kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/15] KVM: Dirty ring interface
@ 2019-11-29 21:34 Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
                   ` (17 more replies)
  0 siblings, 18 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring

Overview
============

This is a continued work from Lei Cao <lei.cao@stratus.com> and Paolo
on the KVM dirty ring interface.  To make it simple, I'll still start
with version 1 as RFC.

The new dirty ring interface is another way to collect dirty pages for
the virtual machine, but it is different from the existing dirty
logging interface in a few ways, majorly:

  - Data format: The dirty data was in a ring format rather than a
    bitmap format, so the size of data to sync for dirty logging does
    not depend on the size of guest memory any more, but speed of
    dirtying.  Also, the dirty ring is per-vcpu (currently plus
    another per-vm ring, so total ring number is N+1), while the dirty
    bitmap is per-vm.

  - Data copy: The sync of dirty pages does not need data copy any more,
    but instead the ring is shared between the userspace and kernel by
    page sharings (mmap() on either the vm fd or vcpu fd)

  - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
    KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
    called KVM_RESET_DIRTY_RINGS when we want to reset the collected
    dirty pages to protected mode again (works like
    KVM_CLEAR_DIRTY_LOG, but ring based)

And more.

I would appreciate if the reviewers can start with patch "KVM:
Implement ring-based dirty memory tracking", especially the document
update part for the big picture.  Then I'll avoid copying into most of
them into cover letter again.

I marked this series as RFC because I'm at least uncertain on this
change of vcpu_enter_guest():

        if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
                vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
                /*
                        * If this is requested, it means that we've
                        * marked the dirty bit in the dirty ring BUT
                        * we've not written the date.  Do it now.
                        */
                r = kvm_emulate_instruction(vcpu, 0);
                r = r >= 0 ? 0 : r;
                goto out;
        }

I did a kvm_emulate_instruction() when dirty ring reaches softlimit
and want to exit to userspace, however I'm not really sure whether
there could have any side effect.  I'd appreciate any comment of
above, or anything else.

Tests
===========

I wanted to continue work on the QEMU part, but after I noticed that
the interface might still prone to change, I posted this series first.
However to make sure it's at least working, I've provided unit tests
together with the series.  The unit tests should be able to test the
series in at least three major paths:

  (1) ./dirty_log_test -M dirty-ring

      This tests async ring operations: this should be the major work
      mode for the dirty ring interface, say, when the kernel is
      queuing more data, the userspace is collecting too.  Ring can
      hardly reaches full when working like this, because in most
      cases the collection could be fast.

  (2) ./dirty_log_test -M dirty-ring -c 1024

      This set the ring size to be very small so that ring soft-full
      always triggers (soft-full is a soft limit of the ring state,
      when the dirty ring reaches the soft limit it'll do a userspace
      exit and let the userspace to collect the data).

  (3) ./dirty_log_test -M dirty-ring-wait-queue

      This sololy test the extreme case where ring is full.  When the
      ring is completely full, the thread (no matter vcpu or not) will
      be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
      wake the threads up (assuming until which the ring will not be
      full any more).

Thanks,

Cao, Lei (2):
  KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
  KVM: X86: Implement ring-based dirty memory tracking

Paolo Bonzini (1):
  KVM: Move running VCPU from ARM to common code

Peter Xu (12):
  KVM: Add build-time error check on kvm_run size
  KVM: Implement ring-based dirty memory tracking
  KVM: Make dirty ring exclusive to dirty bitmap log
  KVM: Introduce dirty ring wait queue
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Sync uapi/linux/kvm.h to tools/
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  KVM: selftests: Add dirty ring buffer test
  KVM: selftests: Let dirty_log_test async for dirty ring test
  KVM: selftests: Add "-c" parameter to dirty log test
  KVM: selftests: Test dirty ring waitqueue

 Documentation/virt/kvm/api.txt                | 116 +++++
 arch/arm/include/asm/kvm_host.h               |   2 -
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/x86/include/asm/kvm_host.h               |   5 +
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/Makefile                         |   3 +-
 arch/x86/kvm/mmu/mmu.c                        |   6 +
 arch/x86/kvm/vmx/vmx.c                        |   7 +
 arch/x86/kvm/x86.c                            |  12 +
 include/linux/kvm_dirty_ring.h                |  67 +++
 include/linux/kvm_host.h                      |  37 ++
 include/linux/kvm_types.h                     |   1 +
 include/uapi/linux/kvm.h                      |  36 ++
 tools/include/uapi/linux/kvm.h                |  47 ++
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 452 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    | 103 ++++
 .../selftests/kvm/lib/kvm_util_internal.h     |   5 +
 virt/kvm/arm/arm.c                            |  29 --
 virt/kvm/arm/perf.c                           |   6 +-
 virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
 virt/kvm/dirty_ring.c                         | 156 ++++++
 virt/kvm/kvm_main.c                           | 315 +++++++++++-
 25 files changed, 1329 insertions(+), 104 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
 create mode 100644 virt/kvm/dirty_ring.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-12-03 19:01   ` Sean Christopherson
  2019-11-29 21:34 ` [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot Peter Xu
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

From: Paolo Bonzini <pbonzini@redhat.com>

For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.

Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code.  This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/include/asm/kvm_host.h   |  2 --
 arch/arm64/include/asm/kvm_host.h |  2 --
 include/linux/kvm_host.h          |  3 +++
 virt/kvm/arm/arm.c                | 29 -----------------------------
 virt/kvm/arm/perf.c               |  6 +++---
 virt/kvm/arm/vgic/vgic-mmio.c     | 15 +++------------
 virt/kvm/kvm_main.c               | 25 ++++++++++++++++++++++++-
 7 files changed, 33 insertions(+), 49 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 556cd818eccf..abc3f6f3ad76 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -284,8 +284,6 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b36dae9ee5f9..d97855e41469 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -446,8 +446,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7ed1e2f8641e..498a39462ac1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1342,6 +1342,9 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 }
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 
+struct kvm_vcpu *kvm_get_running_vcpu(void);
+struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
+
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 12e0280291ce..1df9c39024fa 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -51,9 +51,6 @@ __asm__(".arch_extension	virt");
 DEFINE_PER_CPU(kvm_host_data_t, kvm_host_data);
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
 
-/* Per-CPU variable containing the currently running vcpu. */
-static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
-
 /* The VMID used in the VTTBR */
 static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 static u32 kvm_next_vmid;
@@ -62,31 +59,8 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
 static bool vgic_present;
 
 static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
-
-static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
-{
-	__this_cpu_write(kvm_arm_running_vcpu, vcpu);
-}
-
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
-/**
- * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
- * Must be called from non-preemptible context
- */
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
-{
-	return __this_cpu_read(kvm_arm_running_vcpu);
-}
-
-/**
- * kvm_arm_get_running_vcpus - get the per-CPU array of currently running vcpus.
- */
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
-{
-	return &kvm_arm_running_vcpu;
-}
-
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
@@ -406,7 +380,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vcpu->cpu = cpu;
 	vcpu->arch.host_cpu_context = &cpu_data->host_ctxt;
 
-	kvm_arm_set_running_vcpu(vcpu);
 	kvm_vgic_load(vcpu);
 	kvm_timer_vcpu_load(vcpu);
 	kvm_vcpu_load_sysregs(vcpu);
@@ -432,8 +405,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_vcpu_pmu_restore_host(vcpu);
 
 	vcpu->cpu = -1;
-
-	kvm_arm_set_running_vcpu(NULL);
 }
 
 static void vcpu_power_off(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/perf.c b/virt/kvm/arm/perf.c
index 918cdc3839ea..d45b8b9a4415 100644
--- a/virt/kvm/arm/perf.c
+++ b/virt/kvm/arm/perf.c
@@ -13,14 +13,14 @@
 
 static int kvm_is_in_guest(void)
 {
-        return kvm_arm_get_running_vcpu() != NULL;
+        return kvm_get_running_vcpu() != NULL;
 }
 
 static int kvm_is_user_mode(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return !vcpu_mode_priv(vcpu);
@@ -32,7 +32,7 @@ static unsigned long kvm_get_guest_ip(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return *vcpu_pc(vcpu);
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
index 0d090482720d..d656ebd5f9d4 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -190,15 +190,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
  * value later will give us the same value as we update the per-CPU variable
  * in the preempt notifier handlers.
  */
-static struct kvm_vcpu *vgic_get_mmio_requester_vcpu(void)
-{
-	struct kvm_vcpu *vcpu;
-
-	preempt_disable();
-	vcpu = kvm_arm_get_running_vcpu();
-	preempt_enable();
-	return vcpu;
-}
 
 /* Must be called with irq->irq_lock held */
 static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
@@ -221,7 +212,7 @@ void vgic_mmio_write_spending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -274,7 +265,7 @@ void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -335,7 +326,7 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
 				    bool active)
 {
 	unsigned long flags;
-	struct kvm_vcpu *requester_vcpu = vgic_get_mmio_requester_vcpu();
+	struct kvm_vcpu *requester_vcpu = kvm_get_running_vcpu();
 
 	raw_spin_lock_irqsave(&irq->irq_lock, flags);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 00268290dcbd..fac0760c870e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -108,6 +108,7 @@ struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
 static __read_mostly struct preempt_ops kvm_preempt_ops;
+static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
 
 struct dentry *kvm_debugfs_dir;
 EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
@@ -197,6 +198,8 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
+
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
@@ -208,6 +211,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	__this_cpu_write(kvm_running_vcpu, NULL);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);
@@ -4304,8 +4308,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	WRITE_ONCE(vcpu->preempted, false);
 	WRITE_ONCE(vcpu->ready, false);
 
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	kvm_arch_sched_in(vcpu, cpu);
-
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -4319,6 +4323,25 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 		WRITE_ONCE(vcpu->ready, true);
 	}
 	kvm_arch_vcpu_put(vcpu);
+	__this_cpu_write(kvm_running_vcpu, NULL);
+}
+
+/**
+ * kvm_get_running_vcpu - get the vcpu running on the current CPU.
+ * Thanks to preempt notifiers, this can also be called from
+ * preemptible context.
+ */
+struct kvm_vcpu *kvm_get_running_vcpu(void)
+{
+        return __this_cpu_read(kvm_running_vcpu);
+}
+
+/**
+ * kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
+ */
+struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
+{
+        return &kvm_running_vcpu;
 }
 
 static void check_processor_compat(void *rtn)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-12-02 19:32   ` Sean Christopherson
  2019-11-29 21:34 ` [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size Peter Xu
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

From: "Cao, Lei" <Lei.Cao@stratus.com>

Signed-off-by: Cao, Lei <Lei.Cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fac0760c870e..8f8940cc4b84 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -145,7 +145,10 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_vcpu *vcpu,
+				    struct kvm_memory_slot *memslot,
+				    gfn_t gfn);
 
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -2077,7 +2080,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
-static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
+static int __kvm_write_guest_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
+				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
 {
 	int r;
@@ -2089,7 +2093,7 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	r = __copy_to_user((void __user *)addr + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, vcpu, memslot, gfn);
 	return 0;
 }
 
@@ -2098,7 +2102,8 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(kvm, NULL, slot, gfn, data,
+				      offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -2107,7 +2112,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(vcpu->kvm, vcpu, slot, gfn, data,
+				      offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
@@ -2221,7 +2227,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
+	mark_page_dirty_in_slot(kvm, NULL, ghc->memslot, gpa >> PAGE_SHIFT);
 
 	return 0;
 }
@@ -2286,7 +2292,9 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_vcpu *vcpu,
+				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
 	if (memslot && memslot->dirty_bitmap) {
@@ -2301,7 +2309,7 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = gfn_to_memslot(kvm, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, NULL, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty);
 
@@ -2310,7 +2318,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(vcpu->kvm, vcpu, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-12-02 19:30   ` Sean Christopherson
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

It's already going to reach 2400 Bytes (which is over half of page
size on 4K page archs), so maybe it's good to have this build-time
check in case it overflows when adding new fields.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8f8940cc4b84..681452d288cd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -352,6 +352,8 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	}
 	vcpu->run = page_address(page);
 
+	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
+
 	kvm_vcpu_set_in_spin_loop(vcpu, false);
 	kvm_vcpu_set_dy_eligible(vcpu, false);
 	vcpu->preempted = false;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (2 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-12-02 20:10   ` Sean Christopherson
                     ` (4 more replies)
  2019-11-29 21:34 ` [PATCH RFC 05/15] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
                   ` (13 subsequent siblings)
  17 siblings, 5 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

We defined two new data structures:

  struct kvm_dirty_ring;
  struct kvm_dirty_ring_indexes;

Firstly, kvm_dirty_ring is defined to represent a ring of dirty
pages.  When dirty tracking is enabled, we can push dirty gfn onto the
ring.

Secondly, kvm_dirty_ring_indexes is defined to represent the
user/kernel interface of each ring.  Currently it contains two
indexes: (1) avail_index represents where we should push our next
PFN (written by kernel), while (2) fetch_index represents where the
userspace should fetch the next dirty PFN (written by userspace).

One complete ring is composed by one kvm_dirty_ring plus its
corresponding kvm_dirty_ring_indexes.

Currently, we have N+1 rings for each VM of N vcpus:

  - for each vcpu, we have 1 per-vcpu dirty ring,
  - for each vm, we have 1 per-vm dirty ring

Please refer to the documentation update in this patch for more
details.

Note that this patch implements the core logic of dirty ring buffer.
It's still disabled for all archs for now.  Also, we'll address some
of the other issues in follow up patches before it's firstly enabled
on x86.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt | 109 +++++++++++++++
 arch/x86/kvm/Makefile          |   3 +-
 include/linux/kvm_dirty_ring.h |  67 +++++++++
 include/linux/kvm_host.h       |  33 +++++
 include/linux/kvm_types.h      |   1 +
 include/uapi/linux/kvm.h       |  36 +++++
 virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
 virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
 8 files changed, 642 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 create mode 100644 virt/kvm/dirty_ring.c

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 49183add44e7..fa622c9a2eb8 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
 It is thus encouraged to use the vm ioctl to query for capabilities (available
 with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
 
+
 4.5 KVM_GET_VCPU_MMAP_SIZE
 
 Capability: basic
@@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
 memory region.  This ioctl returns the size of that region.  See the
 KVM_RUN documentation for details.
 
+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+  KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
+  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+
 
 4.6 KVM_SET_MEMORY_REGION
 
@@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
 AArch64, this value will be reported in the ISS field of ESR_ELx.
 
 See KVM_CAP_VCPU_EVENTS for more details.
+
 8.20 KVM_CAP_HYPERV_SEND_IPI
 
 Architectures: x86
@@ -5365,6 +5379,7 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
 
 Architecture: x86
@@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
 flush hypercalls by Hyper-V) so userspace should disable KVM identification
 in CPUID and only exposes Hyper-V identification. In this case, guest
 thinks it's running on Hyper-V and only use Hyper-V hypercalls.
+
+8.22 KVM_CAP_DIRTY_LOG_RING
+
+Architectures: x86
+Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu and one global
+ring per vm.
+
+One dirty ring has the following two major structures:
+
+struct kvm_dirty_ring {
+	u16 dirty_index;
+	u16 reset_index;
+	u32 size;
+	u32 soft_limit;
+	spinlock_t lock;
+	struct kvm_dirty_gfn *dirty_gfns;
+};
+
+struct kvm_dirty_ring_indexes {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
+While for each of the dirty entry it's defined as:
+
+struct kvm_dirty_gfn {
+        __u32 pad;
+        __u32 slot; /* as_id | slot_id */
+        __u64 offset;
+};
+
+The fields in kvm_dirty_ring will be only internal to KVM itself,
+while the fields in kvm_dirty_ring_indexes will be exposed to
+userspace to be either read or written.
+
+The two indices in the ring buffer are free running counters.
+
+In pseudocode, processing the ring buffer looks like this:
+
+	idx = load-acquire(&ring->fetch_index);
+	while (idx != ring->avail_index) {
+		struct kvm_dirty_gfn *entry;
+		entry = &ring->dirty_gfns[idx & (size - 1)];
+		...
+
+		idx++;
+	}
+	ring->fetch_index = idx;
+
+Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
+to enable this capability for the new guest and set the size of the
+rings.  It is only allowed before creating any vCPU, and the size of
+the ring must be a power of two.  The larger the ring buffer, the less
+likely the ring is full and the VM is forced to exit to userspace. The
+optimal size depends on the workload, but it is recommended that it be
+at least 64 KiB (4096 entries).
+
+After the capability is enabled, userspace can mmap the global ring
+buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
+indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
+descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
+is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
+locates inside kvm_run, while kvm_dirty_gfn[] at offset
+KVM_DIRTY_LOG_PAGE_OFFSET).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs up to avail_index, and sets the fetch_index
+accordingly.  This can be done when the guest is running or paused,
+and dirty pages need not be collected all at once.  After processing
+one or more entries in the ring buffer, userspace calls the VM ioctl
+KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
+fetch_index and to mark those pages clean.  Therefore, the ioctl
+must be called *before* reading the content of the dirty pages.
+
+However, there is a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
+userspace it's still possible that the kernel has not yet flushed the
+hardware dirty buffers into the kernel buffer.  To achieve that, one
+needs to kick the vcpu out for a hardware buffer flush (vmexit).
+
+If one of the ring buffers is full, the guest will exit to userspace
+with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
+KVM_RUN ioctl will return -EINTR. Once that happens, userspace
+should pause all the vcpus, then harvest all the dirty pages and
+rearm the dirty traps. It can unpause the guest after that.
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index b19ef421084d..0acee817adfb 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
+				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
+				$(KVM)/dirty_ring.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
new file mode 100644
index 000000000000..8335635b7ff7
--- /dev/null
+++ b/include/linux/kvm_dirty_ring.h
@@ -0,0 +1,67 @@
+#ifndef KVM_DIRTY_RING_H
+#define KVM_DIRTY_RING_H
+
+/*
+ * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
+ *
+ * dirty_ring:  shared with userspace via mmap. It is the compact list
+ *              that holds the dirty pages.
+ * dirty_index: free running counter that points to the next slot in
+ *              dirty_ring->dirty_gfns  where a new dirty page should go.
+ * reset_index: free running counter that points to the next dirty page
+ *              in dirty_ring->dirty_gfns for which dirty trap needs to
+ *              be reenabled
+ * size:        size of the compact list, dirty_ring->dirty_gfns
+ * soft_limit:  when the number of dirty pages in the list reaches this
+ *              limit, vcpu that owns this ring should exit to userspace
+ *              to allow userspace to harvest all the dirty pages
+ * lock:        protects dirty_ring, only in use if this is the global
+ *              ring
+ *
+ * The number of dirty pages in the ring is calculated by,
+ * dirty_index - reset_index
+ *
+ * kernel increments dirty_ring->indices.avail_index after dirty index
+ * is incremented. When userspace harvests the dirty pages, it increments
+ * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
+ * When kernel reenables dirty traps for the dirty pages, it increments
+ * reset_index up to dirty_ring->indices.fetch_index.
+ *
+ */
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	spinlock_t lock;
+	struct kvm_dirty_gfn *dirty_gfns;
+};
+
+u32 kvm_dirty_ring_get_rsvd_entries(void);
+int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
+
+/*
+ * called with kvm->slots_lock held, returns the number of
+ * processed pages.
+ */
+int kvm_dirty_ring_reset(struct kvm *kvm,
+			 struct kvm_dirty_ring *ring,
+			 struct kvm_dirty_ring_indexes *indexes);
+
+/*
+ * returns 0: successfully pushed
+ *         1: successfully pushed, soft limit reached,
+ *            vcpu should exit to userspace
+ *         -EBUSY: unable to push, dirty ring full.
+ */
+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+			struct kvm_dirty_ring_indexes *indexes,
+			u32 slot, u64 offset, bool lock);
+
+/* for use in vm_operations_struct */
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
+
+#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 498a39462ac1..7b747bc9ff3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/kvm_host.h>
+#include <linux/kvm_dirty_ring.h>
 
 #ifndef KVM_MAX_VCPU_ID
 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_PENDING_TIMER     2
 #define KVM_REQ_UNHALT            3
+#define KVM_REQ_DIRTY_RING_FULL   4
 #define KVM_REQUEST_ARCH_BASE     8
 
 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
@@ -321,6 +323,7 @@ struct kvm_vcpu {
 	bool ready;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	struct kvm_dirty_ring dirty_ring;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -501,6 +504,10 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	/* Data structure to be exported by mmap(kvm->fd, 0) */
+	struct kvm_vm_run *vm_run;
+	u32 dirty_ring_size;
+	struct kvm_dirty_ring vm_dirty_ring;
 };
 
 #define kvm_err(fmt, ...) \
@@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 					gfn_t gfn_offset,
 					unsigned long mask);
 
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
+
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 				struct kvm_dirty_log *log);
 int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
@@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 				uintptr_t data, const char *name,
 				struct task_struct **thread_ptr);
 
+/*
+ * This defines how many reserved entries we want to keep before we
+ * kick the vcpu to the userspace to avoid dirty ring full.  This
+ * value can be tuned to higher if e.g. PML is enabled on the host.
+ */
+#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
+
+/* Max number of entries allowed for each kvm dirty ring */
+#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
+
+/*
+ * Arch needs to define these macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures, while
+ * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
+ * feature is off on all archs.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+#ifndef KVM_DIRTY_RING_VERSION
+#define KVM_DIRTY_RING_VERSION 0
+#endif
+
 #endif
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1c88e69db3d9..d9d03eea145a 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
 struct kvm_memory_slot;
 struct kvm_one_reg;
 struct kvm_run;
+struct kvm_vm_run;
 struct kvm_userspace_memory_region;
 struct kvm_vcpu;
 struct kvm_vcpu_init;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6f17c8e2dba..0b88d76d6215 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
 
+struct kvm_dirty_ring_indexes {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -421,6 +427,13 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
+};
+
+/* Returned by mmap(kvm->fd, offset=0) */
+struct kvm_vm_run {
+	struct kvm_dirty_ring_indexes vm_ring_indexes;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
 #define KVM_CAP_ARM_NISV_TO_USER 177
 #define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_DIRTY_LOG_RING 179
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1472,6 +1486,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ *    of kvm_write_* so that the global dirty ring is not filled up
+ *    too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ *    enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ *    dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+	__u32 pad;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
new file mode 100644
index 000000000000..9264891f3c32
--- /dev/null
+++ b/virt/kvm/dirty_ring.c
@@ -0,0 +1,156 @@
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_dirty_ring.h>
+
+u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
+}
+
+int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+	u32 size = kvm->dirty_ring_size;
+
+	ring->dirty_gfns = vmalloc(size);
+	if (!ring->dirty_gfns)
+		return -ENOMEM;
+	memset(ring->dirty_gfns, 0, size);
+
+	ring->size = size / sizeof(struct kvm_dirty_gfn);
+	ring->soft_limit =
+	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
+	    kvm_dirty_ring_get_rsvd_entries();
+	ring->dirty_index = 0;
+	ring->reset_index = 0;
+	spin_lock_init(&ring->lock);
+
+	return 0;
+}
+
+int kvm_dirty_ring_reset(struct kvm *kvm,
+			 struct kvm_dirty_ring *ring,
+			 struct kvm_dirty_ring_indexes *indexes)
+{
+	u32 cur_slot, next_slot;
+	u64 cur_offset, next_offset;
+	unsigned long mask;
+	u32 fetch;
+	int count = 0;
+	struct kvm_dirty_gfn *entry;
+
+	fetch = READ_ONCE(indexes->fetch_index);
+	if (fetch == ring->reset_index)
+		return 0;
+
+	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+	/*
+	 * The ring buffer is shared with userspace, which might mmap
+	 * it and concurrently modify slot and offset.  Userspace must
+	 * not be trusted!  READ_ONCE prevents the compiler from changing
+	 * the values after they've been range-checked (the checks are
+	 * in kvm_reset_dirty_gfn).
+	 */
+	smp_read_barrier_depends();
+	cur_slot = READ_ONCE(entry->slot);
+	cur_offset = READ_ONCE(entry->offset);
+	mask = 1;
+	count++;
+	ring->reset_index++;
+	while (ring->reset_index != fetch) {
+		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+		smp_read_barrier_depends();
+		next_slot = READ_ONCE(entry->slot);
+		next_offset = READ_ONCE(entry->offset);
+		ring->reset_index++;
+		count++;
+		/*
+		 * Try to coalesce the reset operations when the guest is
+		 * scanning pages in the same slot.
+		 */
+		if (next_slot == cur_slot) {
+			int delta = next_offset - cur_offset;
+
+			if (delta >= 0 && delta < BITS_PER_LONG) {
+				mask |= 1ull << delta;
+				continue;
+			}
+
+			/* Backwards visit, careful about overflows!  */
+			if (delta > -BITS_PER_LONG && delta < 0 &&
+			    (mask << -delta >> -delta) == mask) {
+				cur_offset = next_offset;
+				mask = (mask << -delta) | 1;
+				continue;
+			}
+		}
+		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+		cur_slot = next_slot;
+		cur_offset = next_offset;
+		mask = 1;
+	}
+	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+	return count;
+}
+
+static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
+{
+	return ring->dirty_index - ring->reset_index;
+}
+
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->size;
+}
+
+/*
+ * Returns:
+ *   >0 if we should kick the vcpu out,
+ *   =0 if the gfn pushed successfully, or,
+ *   <0 if error (e.g. ring full)
+ */
+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+			struct kvm_dirty_ring_indexes *indexes,
+			u32 slot, u64 offset, bool lock)
+{
+	int ret;
+	struct kvm_dirty_gfn *entry;
+
+	if (lock)
+		spin_lock(&ring->lock);
+
+	if (kvm_dirty_ring_full(ring)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+	entry->slot = slot;
+	entry->offset = offset;
+	smp_wmb();
+	ring->dirty_index++;
+	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
+	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
+	pr_info("%s: slot %u offset %llu used %u\n",
+		__func__, slot, offset, kvm_dirty_ring_used(ring));
+
+out:
+	if (lock)
+		spin_unlock(&ring->lock);
+
+	return ret;
+}
+
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
+{
+	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
+}
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+	if (ring->dirty_gfns) {
+		vfree(ring->dirty_gfns);
+		ring->dirty_gfns = NULL;
+	}
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 681452d288cd..8642c977629b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -64,6 +64,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
 
+#include <linux/kvm_dirty_ring.h>
+
 /* Worst case buffer size needed for holding an integer. */
 #define ITOA_MAX_LEN 12
 
@@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 				    struct kvm_vcpu *vcpu,
 				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn);
+static void mark_page_dirty_in_ring(struct kvm *kvm,
+				    struct kvm_vcpu *vcpu,
+				    struct kvm_memory_slot *slot,
+				    gfn_t gfn);
 
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->preempted = false;
 	vcpu->ready = false;
 
+	if (kvm->dirty_ring_size) {
+		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
+		if (r) {
+			kvm->dirty_ring_size = 0;
+			goto fail_free_run;
+		}
+	}
+
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
-		goto fail_free_run;
+		goto fail_free_ring;
 	return 0;
 
+fail_free_ring:
+	if (kvm->dirty_ring_size)
+		kvm_dirty_ring_free(&vcpu->dirty_ring);
 fail_free_run:
 	free_page((unsigned long)vcpu->run);
 fail:
@@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
+	if (vcpu->kvm->dirty_ring_size)
+		kvm_dirty_ring_free(&vcpu->dirty_ring);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
 
@@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	struct kvm *kvm = kvm_arch_alloc_vm();
 	int r = -ENOMEM;
 	int i;
+	struct page *page;
 
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
@@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page) {
+		r = -ENOMEM;
+		goto out_err_alloc_page;
+	}
+	kvm->vm_run = page_address(page);
+	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
+
 	if (init_srcu_struct(&kvm->srcu))
 		goto out_err_no_srcu;
 	if (init_srcu_struct(&kvm->irq_srcu))
@@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
 out_err_no_irq_srcu:
 	cleanup_srcu_struct(&kvm->srcu);
 out_err_no_srcu:
+	free_page((unsigned long)page);
+	kvm->vm_run = NULL;
+out_err_alloc_page:
 	kvm_arch_free_vm(kvm);
 	mmdrop(current->mm);
 	return ERR_PTR(r);
@@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	if (kvm->dirty_ring_size) {
+		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
+	}
+
+	if (kvm->vm_run) {
+		free_page((unsigned long)kvm->vm_run);
+		kvm->vm_run = NULL;
+	}
+
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
@@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 {
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
-
+		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
 		set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
@@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
+static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
+{
+	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
+	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
+	     kvm->dirty_ring_size / PAGE_SIZE);
+}
+
 static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 {
 	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
@@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
 		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
 #endif
+	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
+		page = kvm_dirty_ring_get_page(
+		    &vcpu->dirty_ring,
+		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
 	else
 		return kvm_arch_vcpu_fault(vcpu, vmf);
 	get_page(page);
@@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 	case KVM_CAP_NR_MEMSLOTS:
 		return KVM_USER_MEM_SLOTS;
+	case KVM_CAP_DIRTY_LOG_RING:
+		/* Version will be zero if arch didn't implement it */
+		return KVM_DIRTY_RING_VERSION;
 	default:
 		break;
 	}
 	return kvm_vm_ioctl_check_extension(kvm, arg);
 }
 
+static void mark_page_dirty_in_ring(struct kvm *kvm,
+				    struct kvm_vcpu *vcpu,
+				    struct kvm_memory_slot *slot,
+				    gfn_t gfn)
+{
+	u32 as_id = 0;
+	u64 offset;
+	int ret;
+	struct kvm_dirty_ring *ring;
+	struct kvm_dirty_ring_indexes *indexes;
+	bool is_vm_ring;
+
+	if (!kvm->dirty_ring_size)
+		return;
+
+	offset = gfn - slot->base_gfn;
+
+	if (vcpu) {
+		as_id = kvm_arch_vcpu_memslots_id(vcpu);
+	} else {
+		as_id = 0;
+		vcpu = kvm_get_running_vcpu();
+	}
+
+	if (vcpu) {
+		ring = &vcpu->dirty_ring;
+		indexes = &vcpu->run->vcpu_ring_indexes;
+		is_vm_ring = false;
+	} else {
+		/*
+		 * Put onto per vm ring because no vcpu context.  Kick
+		 * vcpu0 if ring is full.
+		 */
+		vcpu = kvm->vcpus[0];
+		ring = &kvm->vm_dirty_ring;
+		indexes = &kvm->vm_run->vm_ring_indexes;
+		is_vm_ring = true;
+	}
+
+	ret = kvm_dirty_ring_push(ring, indexes,
+				  (as_id << 16)|slot->id, offset,
+				  is_vm_ring);
+	if (ret < 0) {
+		if (is_vm_ring)
+			pr_warn_once("vcpu %d dirty log overflow\n",
+				     vcpu->vcpu_id);
+		else
+			pr_warn_once("per-vm dirty log overflow\n");
+		return;
+	}
+
+	if (ret)
+		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
+}
+
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+	struct kvm_memory_slot *memslot;
+	int as_id, id;
+
+	as_id = slot >> 16;
+	id = (u16)slot;
+	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+		return;
+
+	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+	if (offset >= memslot->npages)
+		return;
+
+	spin_lock(&kvm->mmu_lock);
+	/* FIXME: we should use a single AND operation, but there is no
+	 * applicable atomic API.
+	 */
+	while (mask) {
+		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
+		mask &= mask - 1;
+	}
+
+	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+	spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
+{
+	int r;
+
+	/* the size should be power of 2 */
+	if (!size || (size & (size - 1)))
+		return -EINVAL;
+
+	/* Should be bigger to keep the reserved entries, or a page */
+	if (size < kvm_dirty_ring_get_rsvd_entries() *
+	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
+		return -EINVAL;
+
+	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
+	    sizeof(struct kvm_dirty_gfn))
+		return -E2BIG;
+
+	/* We only allow it to set once */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	if (kvm->created_vcpus) {
+		/* We don't allow to change this value after vcpu created */
+		r = -EINVAL;
+	} else {
+		kvm->dirty_ring_size = size;
+		r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
+		if (r) {
+			/* Unset dirty ring */
+			kvm->dirty_ring_size = 0;
+		}
+	}
+
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	int cleared = 0;
+
+	if (!kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
+					&kvm->vm_run->vm_ring_indexes);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
+						&vcpu->run->vcpu_ring_indexes);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (cleared)
+		kvm_flush_remote_tlbs(kvm);
+
+	return cleared;
+}
+
 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 						  struct kvm_enable_cap *cap)
 {
@@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 		kvm->manual_dirty_log_protect = cap->args[0];
 		return 0;
 #endif
+	case KVM_CAP_DIRTY_LOG_RING:
+		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_RESET_DIRTY_RINGS:
+		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
+		break;
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
 }
 #endif
 
+static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
+{
+	struct kvm *kvm = vmf->vma->vm_file->private_data;
+	struct page *page = NULL;
+
+	if (vmf->pgoff == 0)
+		page = virt_to_page(kvm->vm_run);
+	else if (kvm_fault_in_dirty_ring(kvm, vmf))
+		page = kvm_dirty_ring_get_page(
+		    &kvm->vm_dirty_ring,
+		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
+	else
+		return VM_FAULT_SIGBUS;
+
+	get_page(page);
+	vmf->page = page;
+	return 0;
+}
+
+static const struct vm_operations_struct kvm_vm_vm_ops = {
+	.fault = kvm_vm_fault,
+};
+
+static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	vma->vm_ops = &kvm_vm_vm_ops;
+	return 0;
+}
+
 static struct file_operations kvm_vm_fops = {
 	.release        = kvm_vm_release,
 	.unlocked_ioctl = kvm_vm_ioctl,
+	.mmap           = kvm_vm_mmap,
 	.llseek		= noop_llseek,
 	KVM_COMPAT(kvm_vm_compat_ioctl),
 };
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 05/15] KVM: Make dirty ring exclusive to dirty bitmap log
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (3 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 06/15] KVM: Introduce dirty ring wait queue Peter Xu
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

There's no good reason to use both the dirty bitmap logging and the
new dirty ring buffer to track dirty bits.  We should be able to even
support both of them at the same time, but it could complicate things
which could actually help little.  Let's simply make it the rule
before we enable dirty ring on any arch, that we don't allow these two
interfaces to be used together.

The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
enablement.  That's where we'll switch from the default dirty logging
way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
correctly, we'll once and for all switch to the dirty ring buffer mode
for the current virtual machine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt |  7 +++++++
 virt/kvm/kvm_main.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index fa622c9a2eb8..9f72ca1fd3e4 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -5487,3 +5487,10 @@ with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
 KVM_RUN ioctl will return -EINTR. Once that happens, userspace
 should pause all the vcpus, then harvest all the dirty pages and
 rearm the dirty traps. It can unpause the guest after that.
+
+NOTE: the KVM_CAP_DIRTY_LOG_RING capability and the new ioctl
+KVM_RESET_DIRTY_RINGS are exclusive to the existing KVM_GET_DIRTY_LOG
+interface.  After enabling KVM_CAP_DIRTY_LOG_RING with an acceptable
+dirty ring size, the virtual machine will switch to the dirty ring
+tracking mode, and KVM_GET_DIRTY_LOG, KVM_CLEAR_DIRTY_LOG ioctls will
+stop working.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8642c977629b..782127d11e9d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1236,6 +1236,10 @@ int kvm_get_dirty_log(struct kvm *kvm,
 	unsigned long n;
 	unsigned long any = 0;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1293,6 +1297,10 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1364,6 +1372,10 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 06/15] KVM: Introduce dirty ring wait queue
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (4 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 05/15] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 07/15] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

When the dirty ring is completely full, right now we throw an error
message and drop the dirty bit.

A better approach could be that we put the thread onto a waitqueue and
retry after another KVM_RESET_DIRTY_RINGS.

We should still allow the process to be killed, so handle it explicitly.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 22 ++++++++++++++++------
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b747bc9ff3e..a1c9ce5f23a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -508,6 +508,7 @@ struct kvm {
 	struct kvm_vm_run *vm_run;
 	u32 dirty_ring_size;
 	struct kvm_dirty_ring vm_dirty_ring;
+	wait_queue_head_t dirty_ring_waitqueue;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 782127d11e9d..bd6172dbff1d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -722,6 +722,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
 	INIT_LIST_HEAD(&kvm->devices);
+	init_waitqueue_head(&kvm->dirty_ring_waitqueue);
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
@@ -3370,16 +3371,23 @@ static void mark_page_dirty_in_ring(struct kvm *kvm,
 		is_vm_ring = true;
 	}
 
+retry:
 	ret = kvm_dirty_ring_push(ring, indexes,
 				  (as_id << 16)|slot->id, offset,
 				  is_vm_ring);
 	if (ret < 0) {
-		if (is_vm_ring)
-			pr_warn_once("vcpu %d dirty log overflow\n",
-				     vcpu->vcpu_id);
-		else
-			pr_warn_once("per-vm dirty log overflow\n");
-		return;
+		/*
+		 * Ring is full, put us onto per-vm waitqueue and wait
+		 * for another KVM_RESET_DIRTY_RINGS to retry
+		 */
+		wait_event_killable(kvm->dirty_ring_waitqueue,
+				    !kvm_dirty_ring_full(ring));
+
+		/* If we're killed, no worry on lossing dirty bits! */
+		if (fatal_signal_pending(current))
+			return;
+
+		goto retry;
 	}
 
 	if (ret)
@@ -3475,6 +3483,8 @@ static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
 	if (cleared)
 		kvm_flush_remote_tlbs(kvm);
 
+	wake_up_all(&kvm->dirty_ring_waitqueue);
+
 	return cleared;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 07/15] KVM: X86: Implement ring-based dirty memory tracking
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (5 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 06/15] KVM: Introduce dirty ring wait queue Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 08/15] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

From: "Cao, Lei" <Lei.Cao@stratus.com>

Add new KVM exit reason KVM_EXIT_DIRTY_LOG_FULL and connect
KVM_REQ_DIRTY_LOG_FULL to it.

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[peterx: rebase, return 0 instead of -EINTR for user exits,
 emul_insn before exit to userspace]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++++
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/mmu/mmu.c          |  6 ++++++
 arch/x86/kvm/vmx/vmx.c          |  7 +++++++
 arch/x86/kvm/x86.c              | 12 ++++++++++++
 5 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b79cd6aa4075..67521627f9e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -49,6 +49,8 @@
 
 #define KVM_IRQCHIP_NUM_PINS  KVM_IOAPIC_NUM_PINS
 
+#define KVM_DIRTY_RING_VERSION 1
+
 /* x86-specific vcpu->requests bit members */
 #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
 #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
@@ -1176,6 +1178,7 @@ struct kvm_x86_ops {
 					   struct kvm_memory_slot *slot,
 					   gfn_t offset, unsigned long mask);
 	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+	int (*cpu_dirty_log_size)(void);
 
 	/* pmu operations of sub-arch */
 	const struct kvm_pmu_ops *pmu_ops;
@@ -1661,4 +1664,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define GET_SMSTATE(type, buf, offset)		\
 	(*(type *)((buf) + (offset) - 0x7e00))
 
+int kvm_cpu_dirty_log_size(void);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 503d3f42da16..b59bf356c478 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -12,6 +12,7 @@
 
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f92b40d798c..f7efb69b089e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1818,7 +1818,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
 {
 	if (kvm_x86_ops->write_log_dirty)
 		return kvm_x86_ops->write_log_dirty(vcpu);
+	return 0;
+}
 
+int kvm_cpu_dirty_log_size(void)
+{
+	if (kvm_x86_ops->cpu_dirty_log_size)
+		return kvm_x86_ops->cpu_dirty_log_size();
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d175429c91b0..871489d92d3c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7710,6 +7710,7 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->slot_disable_log_dirty = NULL;
 		kvm_x86_ops->flush_log_dirty = NULL;
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
+		kvm_x86_ops->cpu_dirty_log_size = NULL;
 	}
 
 	if (!cpu_has_vmx_preemption_timer())
@@ -7774,6 +7775,11 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static int vmx_cpu_dirty_log_size(void)
+{
+	return enable_pml ? PML_ENTITY_NUM : 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7896,6 +7902,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.flush_log_dirty = vmx_flush_log_dirty,
 	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
 	.write_log_dirty = vmx_write_pml_buffer,
+	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
 
 	.pre_block = vmx_pre_block,
 	.post_block = vmx_post_block,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ed167e039e5..03ff34783fa1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8094,6 +8094,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
+
+		if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+			/*
+			 * If this is requested, it means that we've
+			 * marked the dirty bit in the dirty ring BUT
+			 * we've not written the date.  Do it now.
+			 */
+			r = kvm_emulate_instruction(vcpu, 0);
+			r = r >= 0 ? 0 : r;
+			goto out;
+		}
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 08/15] KVM: selftests: Always clear dirty bitmap after iteration
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (6 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 07/15] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-11-29 21:34 ` [PATCH RFC 09/15] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

We don't clear the dirty bitmap before because KVM_GET_DIRTY_LOG will
clear it for us before copying the dirty log onto it.  However we'd
still better to clear it explicitly instead of assuming the kernel
will always do it for us.

More importantly, in the upcoming dirty ring tests we'll start to
fetch dirty pages from a ring buffer, so no one is going to clear the
dirty bitmap for us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 5614222a6628..3c0ffd34b3b0 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -197,7 +197,7 @@ static void vm_dirty_log_verify(unsigned long *bmap)
 				    page);
 		}
 
-		if (test_bit_le(page, bmap)) {
+		if (test_and_clear_bit_le(page, bmap)) {
 			host_dirty_count++;
 			/*
 			 * If the bit is set, the value written onto
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 09/15] KVM: selftests: Sync uapi/linux/kvm.h to tools/
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (7 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 08/15] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
@ 2019-11-29 21:34 ` Peter Xu
  2019-11-29 21:35 ` [PATCH RFC 10/15] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:34 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

This will be needed to extend the kvm selftest program.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/include/uapi/linux/kvm.h | 47 ++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 52641d8ca9e8..0b88d76d6215 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -235,6 +235,8 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_S390_STSI        25
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
+#define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -246,6 +248,11 @@ struct kvm_hyperv_exit {
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
 
+struct kvm_dirty_ring_indexes {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -394,6 +401,11 @@ struct kvm_run {
 		} eoi;
 		/* KVM_EXIT_HYPERV */
 		struct kvm_hyperv_exit hyperv;
+		/* KVM_EXIT_ARM_NISV */
+		struct {
+			__u64 esr_iss;
+			__u64 fault_ipa;
+		} arm_nisv;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -415,6 +427,13 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
+};
+
+/* Returned by mmap(kvm->fd, offset=0) */
+struct kvm_vm_run {
+	struct kvm_dirty_ring_indexes vm_ring_indexes;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1000,6 +1019,10 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PMU_EVENT_FILTER 173
 #define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174
 #define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175
+#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
+#define KVM_CAP_ARM_NISV_TO_USER 177
+#define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_DIRTY_LOG_RING 179
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1227,6 +1250,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
 	KVM_DEV_TYPE_XIVE,
 #define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
+	KVM_DEV_TYPE_ARM_PV_TIME,
+#define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
 	KVM_DEV_TYPE_MAX,
 };
 
@@ -1461,6 +1486,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1611,4 +1639,23 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ *    of kvm_write_* so that the global dirty ring is not filled up
+ *    too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ *    enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ *    dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+	__u32 pad;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 10/15] KVM: selftests: Use a single binary for dirty/clear log test
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (8 preceding siblings ...)
  2019-11-29 21:34 ` [PATCH RFC 09/15] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
@ 2019-11-29 21:35 ` Peter Xu
  2019-11-29 21:35 ` [PATCH RFC 11/15] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:35 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

Remove the clear_dirty_log test, instead merge it into the existing
dirty_log_test.  It should be cleaner to use this single binary to do
both tests, also it's a preparation for the upcoming dirty ring test.

The default test will still be the dirty_log test.  To run the clear
dirty log test, we need to specify "-M clear-log".

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 131 +++++++++++++++---
 3 files changed, 110 insertions(+), 25 deletions(-)
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 3138a916574a..130a7b1c7ad6 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -26,11 +26,9 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
-TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 
-TEST_GEN_PROGS_aarch64 += clear_dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 
diff --git a/tools/testing/selftests/kvm/clear_dirty_log_test.c b/tools/testing/selftests/kvm/clear_dirty_log_test.c
deleted file mode 100644
index 749336937d37..000000000000
--- a/tools/testing/selftests/kvm/clear_dirty_log_test.c
+++ /dev/null
@@ -1,2 +0,0 @@
-#define USE_CLEAR_DIRTY_LOG
-#include "dirty_log_test.c"
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3c0ffd34b3b0..a8ae8c0042a8 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -128,6 +128,66 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+enum log_mode_t {
+	/* Only use KVM_GET_DIRTY_LOG for logging */
+	LOG_MODE_DIRTY_LOG = 0,
+
+	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
+	LOG_MODE_CLERA_LOG = 1,
+
+	LOG_MODE_NUM,
+};
+
+/* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
+static enum log_mode_t host_log_mode;
+
+static void clear_log_create_vm_done(struct kvm_vm *vm)
+{
+	struct kvm_enable_cap cap = {};
+
+	if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)) {
+		fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n");
+		exit(KSFT_SKIP);
+	}
+
+	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
+	cap.args[0] = 1;
+	vm_enable_cap(vm, &cap);
+}
+
+static void dirty_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+}
+
+static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
+}
+
+struct log_mode {
+	const char *name;
+	/* Hook when the vm creation is done (before vcpu creation) */
+	void (*create_vm_done)(struct kvm_vm *vm);
+	/* Hook to collect the dirty pages into the bitmap provided */
+	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
+				     void *bitmap, uint32_t num_pages);
+} log_modes[LOG_MODE_NUM] = {
+	{
+		.name = "dirty-log",
+		.create_vm_done = NULL,
+		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+	},
+	{
+		.name = "clear-log",
+		.create_vm_done = clear_log_create_vm_done,
+		.collect_dirty_pages = clear_log_collect_dirty_pages,
+	},
+};
+
 /*
  * We use this bitmap to track some pages that should have its dirty
  * bit set in the _next_ iteration.  For example, if we detected the
@@ -137,6 +197,33 @@ static uint64_t host_track_next_count;
  */
 static unsigned long *host_bmap_track;
 
+static void log_modes_dump(void)
+{
+	int i;
+
+	for (i = 0; i < LOG_MODE_NUM; i++)
+		printf("%s, ", log_modes[i].name);
+	puts("\b\b  \b\b");
+}
+
+static void log_mode_create_vm_done(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->create_vm_done)
+		mode->create_vm_done(vm);
+}
+
+static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					 void *bitmap, uint32_t num_pages)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	TEST_ASSERT(mode->collect_dirty_pages != NULL,
+		    "collect_dirty_pages() is required for any log mode!");
+	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -257,6 +344,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 #ifdef __x86_64__
 	vm_create_irqchip(vm);
 #endif
+	log_mode_create_vm_done(vm);
 	vm_vcpu_add_default(vm, vcpuid, guest_code);
 	return vm;
 }
@@ -316,14 +404,6 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	bmap = bitmap_alloc(host_num_pages);
 	host_bmap_track = bitmap_alloc(host_num_pages);
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	struct kvm_enable_cap cap = {};
-
-	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
-	cap.args[0] = 1;
-	vm_enable_cap(vm, &cap);
-#endif
-
 	/* Add an extra memory slot for testing dirty logging */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
 				    guest_test_phys_mem,
@@ -364,11 +444,8 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	while (iteration < iterations) {
 		/* Give the vcpu thread some time to dirty some pages */
 		usleep(interval * 1000);
-		kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
-#ifdef USE_CLEAR_DIRTY_LOG
-		kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0,
-				       host_num_pages);
-#endif
+		log_mode_collect_dirty_pages(vm, TEST_MEM_SLOT_INDEX,
+					     bmap, host_num_pages);
 		vm_dirty_log_verify(bmap);
 		iteration++;
 		sync_global_to_guest(vm, iteration);
@@ -413,6 +490,9 @@ static void help(char *name)
 	       TEST_HOST_LOOP_INTERVAL);
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
+	printf(" -M: specify the host logging mode "
+	       "(default: log-dirty).  Supported modes: \n\t");
+	log_modes_dump();
 	printf(" -m: specify the guest mode ID to test "
 	       "(default: test all supported modes)\n"
 	       "     This option may be used multiple times.\n"
@@ -437,13 +517,6 @@ int main(int argc, char *argv[])
 	unsigned int host_ipa_limit;
 #endif
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)) {
-		fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n");
-		exit(KSFT_SKIP);
-	}
-#endif
-
 #ifdef __x86_64__
 	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
 #endif
@@ -463,7 +536,7 @@ int main(int argc, char *argv[])
 	vm_guest_mode_params_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:")) != -1) {
+	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
 		switch (opt) {
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
@@ -485,6 +558,22 @@ int main(int argc, char *argv[])
 				    "Guest mode ID %d too big", mode);
 			vm_guest_mode_params[mode].enabled = true;
 			break;
+		case 'M':
+			for (i = 0; i < LOG_MODE_NUM; i++) {
+				if (!strcmp(optarg, log_modes[i].name)) {
+					DEBUG("Setting log mode to: '%s'\n",
+					      optarg);
+					host_log_mode = i;
+					break;
+				}
+			}
+			if (i == LOG_MODE_NUM) {
+				printf("Log mode '%s' is invalid.  "
+				       "Please choose from: ", optarg);
+				log_modes_dump();
+				exit(-1);
+			}
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 11/15] KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (9 preceding siblings ...)
  2019-11-29 21:35 ` [PATCH RFC 10/15] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
@ 2019-11-29 21:35 ` Peter Xu
  2019-11-29 21:35 ` [PATCH RFC 12/15] KVM: selftests: Add dirty ring buffer test Peter Xu
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:35 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

Provide a hook for the checks after vcpu_run() completes.  Preparation
for the dirty ring test because we'll need to take care of another
exit reason.

Since at it, drop the pages_count because after all we have a better
summary right now with statistics, and clean it up a bit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 39 ++++++++++++--------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index a8ae8c0042a8..3542311f56ff 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -168,6 +168,15 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
+static void default_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+		    "Invalid guest sync status: exit_reason=%s\n",
+		    exit_reason_str(run->exit_reason));
+}
+
 struct log_mode {
 	const char *name;
 	/* Hook when the vm creation is done (before vcpu creation) */
@@ -175,16 +184,20 @@ struct log_mode {
 	/* Hook to collect the dirty pages into the bitmap provided */
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
+	/* Hook to call when after each vcpu run */
+	void (*after_vcpu_run)(struct kvm_vm *vm);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
 		.create_vm_done = NULL,
 		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 	{
 		.name = "clear-log",
 		.create_vm_done = clear_log_create_vm_done,
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 };
 
@@ -224,6 +237,14 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
+static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->after_vcpu_run)
+		mode->after_vcpu_run(vm);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -237,31 +258,17 @@ static void *vcpu_worker(void *data)
 	int ret;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
-	uint64_t pages_count = 0;
-	struct kvm_run *run;
-
-	run = vcpu_state(vm, VCPU_ID);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
-	generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 
 	while (!READ_ONCE(host_quit)) {
+		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		/* Let the guest dirty the random pages */
 		ret = _vcpu_run(vm, VCPU_ID);
 		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
-			pages_count += TEST_PAGES_PER_LOOP;
-			generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
-		} else {
-			TEST_ASSERT(false,
-				    "Invalid guest sync status: "
-				    "exit_reason=%s\n",
-				    exit_reason_str(run->exit_reason));
-		}
+		log_mode_after_vcpu_run(vm);
 	}
 
-	DEBUG("Dirtied %"PRIu64" pages\n", pages_count);
-
 	return NULL;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 12/15] KVM: selftests: Add dirty ring buffer test
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (10 preceding siblings ...)
  2019-11-29 21:35 ` [PATCH RFC 11/15] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
@ 2019-11-29 21:35 ` Peter Xu
  2019-11-29 21:35 ` [PATCH RFC 13/15] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:35 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

Add the initial dirty ring buffer test.

The current test implements the userspace dirty ring collection, by
only reaping the dirty ring when the ring is full.

So it's still running asynchronously like this:

            vcpu                             main thread

  1. vcpu dirties pages
  2. vcpu gets dirty ring full
     (userspace exit)

                                       3. main thread waits until full
                                          (so hardware buffers flushed)
                                       4. main thread collects
                                       5. main thread continues vcpu

  6. vcpu continues, goes back to 1

We can't directly collects dirty bits during vcpu execution because
otherwise we can't guarantee the hardware dirty bits were flushed when
we collect and we're very strict on the dirty bits so otherwise we can
fail the future verify procedure.  A follow up patch will make this
test to support async just like the existing dirty log test, by adding
a vcpu kick mechanism.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 148 ++++++++++++++++++
 .../testing/selftests/kvm/include/kvm_util.h  |   5 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  95 +++++++++++
 .../selftests/kvm/lib/kvm_util_internal.h     |   5 +
 4 files changed, 253 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3542311f56ff..968e35c5d380 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -12,8 +12,10 @@
 #include <unistd.h>
 #include <time.h>
 #include <pthread.h>
+#include <semaphore.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
+#include <asm/barrier.h>
 
 #include "test_util.h"
 #include "kvm_util.h"
@@ -57,6 +59,8 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
+#define TEST_DIRTY_RING_COUNT		1024
+
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
  * sync_global_to/from_guest() are used when accessing from
@@ -128,6 +132,10 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+/* Whether dirty ring reset is requested, or finished */
+static sem_t dirty_ring_vcpu_stop;
+static sem_t dirty_ring_vcpu_cont;
+
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
 	LOG_MODE_DIRTY_LOG = 0,
@@ -135,6 +143,9 @@ enum log_mode_t {
 	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
 	LOG_MODE_CLERA_LOG = 1,
 
+	/* Use dirty ring for logging */
+	LOG_MODE_DIRTY_RING = 2,
+
 	LOG_MODE_NUM,
 };
 
@@ -177,6 +188,123 @@ static void default_after_vcpu_run(struct kvm_vm *vm)
 		    exit_reason_str(run->exit_reason));
 }
 
+static void dirty_ring_create_vm_done(struct kvm_vm *vm)
+{
+	/*
+	 * Switch to dirty ring mode after VM creation but before any
+	 * of the vcpu creation.
+	 */
+	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+			     sizeof(struct kvm_dirty_gfn));
+}
+
+static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
+				       struct kvm_dirty_ring_indexes *indexes,
+				       int slot, void *bitmap,
+				       uint32_t num_pages, int index)
+{
+	struct kvm_dirty_gfn *cur;
+	uint32_t avail, fetch, count = 0;
+
+	/*
+	 * We should keep it somewhere, but to be simple we read
+	 * fetch_index too.
+	 */
+	fetch = READ_ONCE(indexes->fetch_index);
+	avail = READ_ONCE(indexes->avail_index);
+
+	/* Make sure we read valid entries always */
+	rmb();
+
+	DEBUG("ring %d: fetch: 0x%x, avail: 0x%x\n", index, fetch, avail);
+
+	while (fetch != avail) {
+		cur = &dirty_gfns[fetch % test_dirty_ring_count];
+		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
+		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
+			    "%u != %u", cur->slot, slot);
+		TEST_ASSERT(cur->offset < num_pages, "Offset overflow: "
+			    "0x%llx >= 0x%llx", cur->offset, num_pages);
+		//DEBUG("slot %d offset %llu\n", cur->slot, cur->offset);
+		test_and_set_bit(cur->offset, bitmap);
+		fetch++;
+		count++;
+	}
+	WRITE_ONCE(indexes->fetch_index, fetch);
+
+	return count;
+}
+
+static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					   void *bitmap, uint32_t num_pages)
+{
+	/* We only have one vcpu */
+	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
+	struct kvm_vm_run *vm_run = vm_state(vm);
+	uint32_t count = 0, cleared;
+
+	/*
+	 * Before fetching the dirty pages, we need a vmexit of the
+	 * worker vcpu to make sure the hardware dirty buffers were
+	 * flushed.  This is not needed for dirty-log/clear-log tests
+	 * because get dirty log will natually do so.
+	 *
+	 * For now we do it in the simple way - we simply wait until
+	 * the vcpu uses up the soft dirty ring, then it'll always
+	 * do a vmexit to make sure that PML buffers will be flushed.
+	 * In real hypervisors, we probably need a vcpu kick or to
+	 * stop the vcpus (before the final sync) to make sure we'll
+	 * get all the existing dirty PFNs even cached in hardware.
+	 */
+	sem_wait(&dirty_ring_vcpu_stop);
+
+	count += dirty_ring_collect_one(kvm_map_dirty_ring(vm),
+					&vm_run->vm_ring_indexes,
+					slot, bitmap, num_pages, -1);
+
+	/* Only have one vcpu */
+	count += dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
+					&state->vcpu_ring_indexes,
+					slot, bitmap, num_pages, VCPU_ID);
+
+	cleared = kvm_vm_reset_dirty_ring(vm);
+
+	/* Cleared pages should be the same as collected */
+	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
+		    "with collected (%u)", cleared, count);
+
+	DEBUG("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+
+	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
+}
+
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	/* A ucall-sync or ring-full event is allowed */
+	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
+		/* We should allow this to continue */
+		;
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+		sem_post(&dirty_ring_vcpu_stop);
+		DEBUG("vcpu stops because dirty ring full...\n");
+		sem_wait(&dirty_ring_vcpu_cont);
+		DEBUG("vcpu continues now.\n");
+	} else {
+		TEST_ASSERT(false, "Invalid guest sync status: "
+			    "exit_reason=%s\n",
+			    exit_reason_str(run->exit_reason));
+	}
+}
+
+static void dirty_ring_before_vcpu_join(void)
+{
+	/* Kick another round of vcpu just to make sure it will quit */
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 struct log_mode {
 	const char *name;
 	/* Hook when the vm creation is done (before vcpu creation) */
@@ -186,6 +314,7 @@ struct log_mode {
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
 	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
@@ -199,6 +328,13 @@ struct log_mode {
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
 		.after_vcpu_run = default_after_vcpu_run,
 	},
+	{
+		.name = "dirty-ring",
+		.create_vm_done = dirty_ring_create_vm_done,
+		.collect_dirty_pages = dirty_ring_collect_dirty_pages,
+		.before_vcpu_join = dirty_ring_before_vcpu_join,
+		.after_vcpu_run = dirty_ring_after_vcpu_run,
+	},
 };
 
 /*
@@ -245,6 +381,14 @@ static void log_mode_after_vcpu_run(struct kvm_vm *vm)
 		mode->after_vcpu_run(vm);
 }
 
+static void log_mode_before_vcpu_join(void)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->before_vcpu_join)
+		mode->before_vcpu_join();
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -460,6 +604,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
+	log_mode_before_vcpu_join();
 	pthread_join(vcpu_thread, NULL);
 
 	DEBUG("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
@@ -524,6 +669,9 @@ int main(int argc, char *argv[])
 	unsigned int host_ipa_limit;
 #endif
 
+	sem_init(&dirty_ring_vcpu_stop, 0, 0);
+	sem_init(&dirty_ring_vcpu_cont, 0, 0);
+
 #ifdef __x86_64__
 	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
 #endif
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 29cccaf96baf..5ad52f38af8d 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -67,6 +67,7 @@ enum vm_mem_backing_src_type {
 
 int kvm_check_cap(long cap);
 int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 
 struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
 struct kvm_vm *_vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
@@ -76,6 +77,7 @@ void kvm_vm_release(struct kvm_vm *vmp);
 void kvm_vm_get_dirty_log(struct kvm_vm *vm, int slot, void *log);
 void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 			    uint64_t first_page, uint32_t num_pages);
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm);
 
 int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
 		       size_t len);
@@ -111,6 +113,7 @@ vm_paddr_t addr_hva2gpa(struct kvm_vm *vm, void *hva);
 vm_paddr_t addr_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva);
 
 struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid);
+struct kvm_vm_run *vm_state(struct kvm_vm *vm);
 void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid);
@@ -137,6 +140,8 @@ void vcpu_nested_state_get(struct kvm_vm *vm, uint32_t vcpuid,
 int vcpu_nested_state_set(struct kvm_vm *vm, uint32_t vcpuid,
 			  struct kvm_nested_state *state, bool ignore_error);
 #endif
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid);
+void *kvm_map_dirty_ring(struct kvm_vm *vm);
 
 const char *exit_reason_str(unsigned int exit_reason);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 41cf45416060..3a71e66a0b58 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -85,6 +85,26 @@ int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap)
 	return ret;
 }
 
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size)
+{
+	struct kvm_enable_cap cap = {};
+	int ret;
+
+	ret = kvm_check_cap(KVM_CAP_DIRTY_LOG_RING);
+
+	TEST_ASSERT(ret >= 0, "KVM_CAP_DIRTY_LOG_RING");
+
+	if (ret == 0) {
+		fprintf(stderr, "KVM does not support dirty ring, skipping tests\n");
+		exit(KSFT_SKIP);
+	}
+
+	cap.cap = KVM_CAP_DIRTY_LOG_RING;
+	cap.args[0] = ring_size;
+	vm_enable_cap(vm, &cap);
+	vm->dirty_ring_size = ring_size;
+}
+
 static void vm_open(struct kvm_vm *vm, int perm)
 {
 	vm->kvm_fd = open(KVM_DEV_PATH, perm);
@@ -297,6 +317,11 @@ void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 		    strerror(-ret));
 }
 
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm)
+{
+	return ioctl(vm->fd, KVM_RESET_DIRTY_RINGS);
+}
+
 /*
  * Userspace Memory Region Find
  *
@@ -408,6 +433,13 @@ static void vm_vcpu_rm(struct kvm_vm *vm, uint32_t vcpuid)
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
 	int ret;
 
+	if (vcpu->dirty_gfns) {
+		ret = munmap(vcpu->dirty_gfns, vm->dirty_ring_size);
+		TEST_ASSERT(ret == 0, "munmap of VCPU dirty ring failed, "
+			    "rc: %i errno: %i", ret, errno);
+		vcpu->dirty_gfns = NULL;
+	}
+
 	ret = munmap(vcpu->state, sizeof(*vcpu->state));
 	TEST_ASSERT(ret == 0, "munmap of VCPU fd failed, rc: %i "
 		"errno: %i", ret, errno);
@@ -447,6 +479,16 @@ void kvm_vm_free(struct kvm_vm *vmp)
 {
 	int ret;
 
+	if (vmp->vm_run) {
+		munmap(vmp->vm_run, sizeof(struct kvm_vm_run));
+		vmp->vm_run = NULL;
+	}
+
+	if (vmp->vm_dirty_gfns) {
+		munmap(vmp->vm_dirty_gfns, vmp->dirty_ring_size);
+		vmp->vm_dirty_gfns = NULL;
+	}
+
 	if (vmp == NULL)
 		return;
 
@@ -1122,6 +1164,18 @@ struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid)
 	return vcpu->state;
 }
 
+struct kvm_vm_run *vm_state(struct kvm_vm *vm)
+{
+	if (!vm->vm_run) {
+		vm->vm_run = (struct kvm_vm_run *)
+		    mmap(NULL, sizeof(struct kvm_vm_run),
+			 PROT_READ | PROT_WRITE, MAP_SHARED, vm->fd, 0);
+		TEST_ASSERT(vm->vm_run != MAP_FAILED,
+			    "kvm vm run mapping failed");
+	}
+	return vm->vm_run;
+}
+
 /*
  * VM VCPU Run
  *
@@ -1409,6 +1463,46 @@ int _vcpu_ioctl(struct kvm_vm *vm, uint32_t vcpuid,
 	return ret;
 }
 
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu;
+	uint32_t size = vm->dirty_ring_size;
+
+	TEST_ASSERT(size > 0, "Should enable dirty ring first");
+
+	vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu, "Cannot find vcpu %u", vcpuid);
+
+	if (!vcpu->dirty_gfns) {
+		vcpu->dirty_gfns_count = size / sizeof(struct kvm_dirty_gfn);
+		vcpu->dirty_gfns = mmap(NULL, size, PROT_READ | PROT_WRITE,
+					MAP_SHARED, vcpu->fd, vm->page_size *
+					KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(vcpu->dirty_gfns != MAP_FAILED,
+			    "Dirty ring map failed");
+	}
+
+	return vcpu->dirty_gfns;
+}
+
+void *kvm_map_dirty_ring(struct kvm_vm *vm)
+{
+	uint32_t size = vm->dirty_ring_size;
+
+	TEST_ASSERT(size > 0, "Should enable dirty ring first");
+
+	if (!vm->vm_dirty_gfns) {
+		vm->vm_dirty_gfns = mmap(NULL, size, PROT_READ | PROT_WRITE,
+					 MAP_SHARED, vm->fd, vm->page_size *
+					 KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(vm->vm_dirty_gfns != MAP_FAILED,
+			    "Dirty ring map failed");
+	}
+
+	return vm->vm_dirty_gfns;
+}
+
 /*
  * VM Ioctl
  *
@@ -1503,6 +1597,7 @@ static struct exit_reason {
 	{KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
 	{KVM_EXIT_OSI, "OSI"},
 	{KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
+	{KVM_EXIT_DIRTY_RING_FULL, "DIRTY_RING_FULL"},
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
 #endif
diff --git a/tools/testing/selftests/kvm/lib/kvm_util_internal.h b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
index ac50c42750cf..3423d78d7993 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util_internal.h
+++ b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
@@ -39,6 +39,8 @@ struct vcpu {
 	uint32_t id;
 	int fd;
 	struct kvm_run *state;
+	struct kvm_dirty_gfn *dirty_gfns;
+	uint32_t dirty_gfns_count;
 };
 
 struct kvm_vm {
@@ -61,6 +63,9 @@ struct kvm_vm {
 	vm_paddr_t pgd;
 	vm_vaddr_t gdt;
 	vm_vaddr_t tss;
+	uint32_t dirty_ring_size;
+	struct kvm_vm_run *vm_run;
+	struct kvm_dirty_gfn *vm_dirty_gfns;
 };
 
 struct vcpu *vcpu_find(struct kvm_vm *vm, uint32_t vcpuid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 13/15] KVM: selftests: Let dirty_log_test async for dirty ring test
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (11 preceding siblings ...)
  2019-11-29 21:35 ` [PATCH RFC 12/15] KVM: selftests: Add dirty ring buffer test Peter Xu
@ 2019-11-29 21:35 ` Peter Xu
  2019-11-29 21:35 ` [PATCH RFC 14/15] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:35 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

Previously the dirty ring test was working in synchronous way, because
only with a vmexit (with that it was the ring full event) we'll know
the hardware dirty bits will be flushed to the dirty ring.

With this patch we first introduced the vcpu kick mechanism by using
SIGUSR1, meanwhile we can have a guarantee of vmexit and also the
flushing of hardware dirty bits.  With all these, we can keep the vcpu
dirty work asynchronous of the whole collection procedure now.

Further increase the dirty ring size to current maximum to make sure
we torture more on the no-ring-full case, which should be the major
scenario when the hypervisors like QEMU would like to use this feature.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 74 ++++++++++++-------
 .../testing/selftests/kvm/include/kvm_util.h  |  1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  8 ++
 3 files changed, 57 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 968e35c5d380..4799db91e919 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -13,6 +13,9 @@
 #include <time.h>
 #include <pthread.h>
 #include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 #include <asm/barrier.h>
@@ -59,7 +62,9 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
-#define TEST_DIRTY_RING_COUNT		1024
+#define TEST_DIRTY_RING_COUNT		65536
+
+#define SIG_IPI SIGUSR1
 
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
@@ -151,6 +156,20 @@ enum log_mode_t {
 
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
+pthread_t vcpu_thread;
+
+/* Only way to pass this to the signal handler */
+struct kvm_vm *current_vm;
+
+static void vcpu_sig_handler(int sig)
+{
+	TEST_ASSERT(sig == SIG_IPI, "unknown signal: %d", sig);
+}
+
+static void vcpu_kick(void)
+{
+	pthread_kill(vcpu_thread, SIG_IPI);
+}
 
 static void clear_log_create_vm_done(struct kvm_vm *vm)
 {
@@ -179,10 +198,13 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
-static void default_after_vcpu_run(struct kvm_vm *vm)
+static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
+	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
+		    "vcpu run failed: errno=%d", err);
+
 	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
 		    "Invalid guest sync status: exit_reason=%s\n",
 		    exit_reason_str(run->exit_reason));
@@ -244,19 +266,15 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	uint32_t count = 0, cleared;
 
 	/*
-	 * Before fetching the dirty pages, we need a vmexit of the
-	 * worker vcpu to make sure the hardware dirty buffers were
-	 * flushed.  This is not needed for dirty-log/clear-log tests
-	 * because get dirty log will natually do so.
-	 *
-	 * For now we do it in the simple way - we simply wait until
-	 * the vcpu uses up the soft dirty ring, then it'll always
-	 * do a vmexit to make sure that PML buffers will be flushed.
-	 * In real hypervisors, we probably need a vcpu kick or to
-	 * stop the vcpus (before the final sync) to make sure we'll
-	 * get all the existing dirty PFNs even cached in hardware.
+	 * These steps will make sure hardware buffer flushed to dirty
+	 * ring.  Now with the vcpu kick mechanism we can keep the
+	 * vcpu running even during collecting dirty bits without ring
+	 * full.
 	 */
+	vcpu_kick();
 	sem_wait(&dirty_ring_vcpu_stop);
+	DEBUG("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
 
 	count += dirty_ring_collect_one(kvm_map_dirty_ring(vm),
 					&vm_run->vm_ring_indexes,
@@ -273,13 +291,10 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
 		    "with collected (%u)", cleared, count);
 
-	DEBUG("Notifying vcpu to continue\n");
-	sem_post(&dirty_ring_vcpu_cont);
-
 	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
 }
 
-static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
@@ -287,9 +302,11 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
 	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 		/* We should allow this to continue */
 		;
-	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
+		   (ret == -1 && err == EINTR)) {
+		/* Either ring full, or we're probably kicked out */
 		sem_post(&dirty_ring_vcpu_stop);
-		DEBUG("vcpu stops because dirty ring full...\n");
+		DEBUG("vcpu stops because dirty ring full or kicked...\n");
 		sem_wait(&dirty_ring_vcpu_cont);
 		DEBUG("vcpu continues now.\n");
 	} else {
@@ -313,7 +330,7 @@ struct log_mode {
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
-	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*after_vcpu_run)(struct kvm_vm *vm, int ret, int err);
 	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
@@ -373,12 +390,12 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
-static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+static void log_mode_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct log_mode *mode = &log_modes[host_log_mode];
 
 	if (mode->after_vcpu_run)
-		mode->after_vcpu_run(vm);
+		mode->after_vcpu_run(vm, ret, err);
 }
 
 static void log_mode_before_vcpu_join(void)
@@ -402,15 +419,21 @@ static void *vcpu_worker(void *data)
 	int ret;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
+	struct sigaction sigact;
+
+	current_vm = vm;
+	memset(&sigact, 0, sizeof(sigact));
+	sigact.sa_handler = vcpu_sig_handler;
+	sigaction(SIG_IPI, &sigact, NULL);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
 
 	while (!READ_ONCE(host_quit)) {
+		/* Clear any existing kick signals */
 		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		/* Let the guest dirty the random pages */
-		ret = _vcpu_run(vm, VCPU_ID);
-		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		log_mode_after_vcpu_run(vm);
+		ret = __vcpu_run(vm, VCPU_ID);
+		log_mode_after_vcpu_run(vm, ret, errno);
 	}
 
 	return NULL;
@@ -506,7 +529,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 		     unsigned long interval, uint64_t phys_offset)
 {
-	pthread_t vcpu_thread;
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 5ad52f38af8d..fe5db2da7e73 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -116,6 +116,7 @@ struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid);
 struct kvm_vm_run *vm_state(struct kvm_vm *vm);
 void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
+int __vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_set_mp_state(struct kvm_vm *vm, uint32_t vcpuid,
 		       struct kvm_mp_state *mp_state);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3a71e66a0b58..2addd0a7310f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1209,6 +1209,14 @@ int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
 	return rc;
 }
 
+int __vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu != NULL, "vcpu not found, vcpuid: %u", vcpuid);
+	return ioctl(vcpu->fd, KVM_RUN, NULL);
+}
+
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid)
 {
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 14/15] KVM: selftests: Add "-c" parameter to dirty log test
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (12 preceding siblings ...)
  2019-11-29 21:35 ` [PATCH RFC 13/15] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
@ 2019-11-29 21:35 ` Peter Xu
  2019-11-29 21:35 ` [PATCH RFC 15/15] KVM: selftests: Test dirty ring waitqueue Peter Xu
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:35 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

It's only used to override the existing dirty ring size/count.  If
with a bigger ring count, we test async of dirty ring.  If with a
smaller ring count, we test ring full code path.

It has no use for non-dirty-ring tests.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 4799db91e919..c9db136a1f12 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -157,6 +157,7 @@ enum log_mode_t {
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
 pthread_t vcpu_thread;
+static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
 
 /* Only way to pass this to the signal handler */
 struct kvm_vm *current_vm;
@@ -216,7 +217,7 @@ static void dirty_ring_create_vm_done(struct kvm_vm *vm)
 	 * Switch to dirty ring mode after VM creation but before any
 	 * of the vcpu creation.
 	 */
-	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+	vm_enable_dirty_ring(vm, test_dirty_ring_count *
 			     sizeof(struct kvm_dirty_gfn));
 }
 
@@ -658,6 +659,9 @@ static void help(char *name)
 	printf("usage: %s [-h] [-i iterations] [-I interval] "
 	       "[-p offset] [-m mode]\n", name);
 	puts("");
+	printf(" -c: specify dirty ring size, in number of entries\n");
+	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
+	       TEST_DIRTY_RING_COUNT);
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
 	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
@@ -713,8 +717,11 @@ int main(int argc, char *argv[])
 	vm_guest_mode_params_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
 		switch (opt) {
+		case 'c':
+			test_dirty_ring_count = strtol(optarg, NULL, 10);
+			break;
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
 			break;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 15/15] KVM: selftests: Test dirty ring waitqueue
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (13 preceding siblings ...)
  2019-11-29 21:35 ` [PATCH RFC 14/15] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
@ 2019-11-29 21:35 ` Peter Xu
  2019-11-30  8:29 ` [PATCH RFC 00/15] KVM: Dirty ring interface Paolo Bonzini
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:35 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

This is a bit tricky, but should still be reasonable.

Firstly we introduce a totally new dirty log test type, because we
need to force vcpu to go into a blocked state by dead loop on vcpu_run
even if it wants to quit to userspace.

Here the tricky part is we need to read the procfs to make sure the
vcpu thread is TASK_UNINTERRUPTIBLE.

After that, we reset the ring and the reset should kick the vcpu again
by moving out of that state.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 101 +++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index c9db136a1f12..41bc015131e1 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -16,6 +16,7 @@
 #include <sys/types.h>
 #include <signal.h>
 #include <errno.h>
+#include <sys/syscall.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 #include <asm/barrier.h>
@@ -151,12 +152,16 @@ enum log_mode_t {
 	/* Use dirty ring for logging */
 	LOG_MODE_DIRTY_RING = 2,
 
+	/* Dirty ring test but tailored for the waitqueue */
+	LOG_MODE_DIRTY_RING_WP = 3,
+
 	LOG_MODE_NUM,
 };
 
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
 pthread_t vcpu_thread;
+pid_t vcpu_thread_tid;
 static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
 
 /* Only way to pass this to the signal handler */
@@ -221,6 +226,18 @@ static void dirty_ring_create_vm_done(struct kvm_vm *vm)
 			     sizeof(struct kvm_dirty_gfn));
 }
 
+static void dirty_ring_wq_create_vm_done(struct kvm_vm *vm)
+{
+	/*
+	 * Force to use a relatively small ring size, so easier to get
+	 * full.  Better bigger than PML size, hence 1024.
+	 */
+	test_dirty_ring_count = 1024;
+	DEBUG("Forcing ring size: %u\n", test_dirty_ring_count);
+	vm_enable_dirty_ring(vm, test_dirty_ring_count *
+			     sizeof(struct kvm_dirty_gfn));
+}
+
 static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 				       struct kvm_dirty_ring_indexes *indexes,
 				       int slot, void *bitmap,
@@ -295,6 +312,81 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
 }
 
+/*
+ * Return 'D' for uninterruptible, 'R' for running, 'S' for
+ * interruptible, etc.
+ */
+static char read_tid_status_char(unsigned int tid)
+{
+	int fd, ret, line = 0;
+	char buf[128], *c;
+
+	snprintf(buf, sizeof(buf) - 1, "/proc/%u/status", tid);
+	fd = open(buf, O_RDONLY);
+	TEST_ASSERT(fd >= 0, "open status file failed: %s", buf);
+	ret = read(fd, buf, sizeof(buf) - 1);
+	TEST_ASSERT(ret > 0, "read status file failed: %d, %d", ret, errno);
+	close(fd);
+
+	/* Skip 2 lines */
+	for (c = buf; c < buf + sizeof(buf) && line < 2; c++) {
+		if (*c == '\n') {
+			line++;
+			continue;
+		}
+	}
+
+	/* Skip "Status:  " */
+	while (*c != ':') c++;
+	c++;
+	while (*c == ' ') c++;
+	c++;
+
+	return *c;
+}
+
+static void dirty_ring_wq_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					      void *bitmap, uint32_t num_pages)
+{
+	uint32_t count = test_dirty_ring_count;
+	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
+	struct kvm_dirty_ring_indexes *indexes = &state->vcpu_ring_indexes;
+	uint32_t avail;
+
+	while (count--) {
+		/*
+		 * Force vcpu to run enough time to make sure we
+		 * trigger the ring full case
+		 */
+		sem_post(&dirty_ring_vcpu_cont);
+	}
+
+	/* Make sure it's stuck */
+	TEST_ASSERT(vcpu_thread_tid, "TID not inited");
+        /*
+	 * Wait for /proc/pid/status "Status:" changes to "D". "D"
+	 * stands for "D (disk sleep)", TASK_UNINTERRUPTIBLE
+	 */
+	while (read_tid_status_char(vcpu_thread_tid) != 'D') {
+		usleep(1000);
+	}
+	DEBUG("Now VCPU thread dirty ring full\n");
+
+	avail = READ_ONCE(indexes->avail_index);
+	/* Assuming we've consumed all */
+	WRITE_ONCE(indexes->fetch_index, avail);
+
+	kvm_vm_reset_dirty_ring(vm);
+
+	/* Wait for it to be awake */
+	while (read_tid_status_char(vcpu_thread_tid) == 'D') {
+		usleep(1000);
+	}
+	DEBUG("VCPU Thread is successfully waked up\n");
+
+	exit(0);
+}
+
 static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
@@ -353,6 +445,12 @@ struct log_mode {
 		.before_vcpu_join = dirty_ring_before_vcpu_join,
 		.after_vcpu_run = dirty_ring_after_vcpu_run,
 	},
+	{
+		.name = "dirty-ring-wait-queue",
+		.create_vm_done = dirty_ring_wq_create_vm_done,
+		.collect_dirty_pages = dirty_ring_wq_collect_dirty_pages,
+		.after_vcpu_run = dirty_ring_after_vcpu_run,
+	},
 };
 
 /*
@@ -422,6 +520,9 @@ static void *vcpu_worker(void *data)
 	uint64_t *guest_array;
 	struct sigaction sigact;
 
+	vcpu_thread_tid = syscall(SYS_gettid);
+	printf("VCPU Thread ID: %u\n", vcpu_thread_tid);
+
 	current_vm = vm;
 	memset(&sigact, 0, sizeof(sigact));
 	sigact.sa_handler = vcpu_sig_handler;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (14 preceding siblings ...)
  2019-11-29 21:35 ` [PATCH RFC 15/15] KVM: selftests: Test dirty ring waitqueue Peter Xu
@ 2019-11-30  8:29 ` Paolo Bonzini
  2019-12-02  2:13   ` Peter Xu
  2019-12-02 20:21   ` Sean Christopherson
  2019-12-04 10:39 ` Jason Wang
  2019-12-11 13:41 ` Christophe de Dinechin
  17 siblings, 2 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-11-30  8:29 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, kvm
  Cc: Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov

Hi Peter,

thanks for the RFC!  Just a couple comments before I look at the series
(for which I don't expect many surprises).

On 29/11/19 22:34, Peter Xu wrote:
> I marked this series as RFC because I'm at least uncertain on this
> change of vcpu_enter_guest():
> 
>         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
>                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>                 /*
>                         * If this is requested, it means that we've
>                         * marked the dirty bit in the dirty ring BUT
>                         * we've not written the date.  Do it now.
>                         */
>                 r = kvm_emulate_instruction(vcpu, 0);
>                 r = r >= 0 ? 0 : r;
>                 goto out;
>         }

This is not needed, it will just be a false negative (dirty page that
actually isn't dirty).  The dirty bit will be cleared when userspace
resets the ring buffer; then the instruction will be executed again and
mark the page dirty again.  Since ring full is not a common condition,
it's not a big deal.

> I did a kvm_emulate_instruction() when dirty ring reaches softlimit
> and want to exit to userspace, however I'm not really sure whether
> there could have any side effect.  I'd appreciate any comment of
> above, or anything else.
> 
> Tests
> ===========
> 
> I wanted to continue work on the QEMU part, but after I noticed that
> the interface might still prone to change, I posted this series first.
> However to make sure it's at least working, I've provided unit tests
> together with the series.  The unit tests should be able to test the
> series in at least three major paths:
> 
>   (1) ./dirty_log_test -M dirty-ring
> 
>       This tests async ring operations: this should be the major work
>       mode for the dirty ring interface, say, when the kernel is
>       queuing more data, the userspace is collecting too.  Ring can
>       hardly reaches full when working like this, because in most
>       cases the collection could be fast.
> 
>   (2) ./dirty_log_test -M dirty-ring -c 1024
> 
>       This set the ring size to be very small so that ring soft-full
>       always triggers (soft-full is a soft limit of the ring state,
>       when the dirty ring reaches the soft limit it'll do a userspace
>       exit and let the userspace to collect the data).
> 
>   (3) ./dirty_log_test -M dirty-ring-wait-queue
> 
>       This sololy test the extreme case where ring is full.  When the
>       ring is completely full, the thread (no matter vcpu or not) will
>       be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
>       wake the threads up (assuming until which the ring will not be
>       full any more).

One question about this testcase: why does the task get into
uninterruptible wait?

Paolo

> 
> Thanks,
> 
> Cao, Lei (2):
>   KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
>   KVM: X86: Implement ring-based dirty memory tracking
> 
> Paolo Bonzini (1):
>   KVM: Move running VCPU from ARM to common code
> 
> Peter Xu (12):
>   KVM: Add build-time error check on kvm_run size
>   KVM: Implement ring-based dirty memory tracking
>   KVM: Make dirty ring exclusive to dirty bitmap log
>   KVM: Introduce dirty ring wait queue
>   KVM: selftests: Always clear dirty bitmap after iteration
>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>   KVM: selftests: Use a single binary for dirty/clear log test
>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>   KVM: selftests: Add dirty ring buffer test
>   KVM: selftests: Let dirty_log_test async for dirty ring test
>   KVM: selftests: Add "-c" parameter to dirty log test
>   KVM: selftests: Test dirty ring waitqueue
> 
>  Documentation/virt/kvm/api.txt                | 116 +++++
>  arch/arm/include/asm/kvm_host.h               |   2 -
>  arch/arm64/include/asm/kvm_host.h             |   2 -
>  arch/x86/include/asm/kvm_host.h               |   5 +
>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>  arch/x86/kvm/Makefile                         |   3 +-
>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>  arch/x86/kvm/vmx/vmx.c                        |   7 +
>  arch/x86/kvm/x86.c                            |  12 +
>  include/linux/kvm_dirty_ring.h                |  67 +++
>  include/linux/kvm_host.h                      |  37 ++
>  include/linux/kvm_types.h                     |   1 +
>  include/uapi/linux/kvm.h                      |  36 ++
>  tools/include/uapi/linux/kvm.h                |  47 ++
>  tools/testing/selftests/kvm/Makefile          |   2 -
>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>  tools/testing/selftests/kvm/dirty_log_test.c  | 452 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   6 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 103 ++++
>  .../selftests/kvm/lib/kvm_util_internal.h     |   5 +
>  virt/kvm/arm/arm.c                            |  29 --
>  virt/kvm/arm/perf.c                           |   6 +-
>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>  virt/kvm/dirty_ring.c                         | 156 ++++++
>  virt/kvm/kvm_main.c                           | 315 +++++++++++-
>  25 files changed, 1329 insertions(+), 104 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>  create mode 100644 virt/kvm/dirty_ring.c
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-11-30  8:29 ` [PATCH RFC 00/15] KVM: Dirty ring interface Paolo Bonzini
@ 2019-12-02  2:13   ` Peter Xu
  2019-12-03 13:59     ` Paolo Bonzini
  2019-12-02 20:21   ` Sean Christopherson
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-02  2:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Sat, Nov 30, 2019 at 09:29:42AM +0100, Paolo Bonzini wrote:
> Hi Peter,
> 
> thanks for the RFC!  Just a couple comments before I look at the series
> (for which I don't expect many surprises).
> 
> On 29/11/19 22:34, Peter Xu wrote:
> > I marked this series as RFC because I'm at least uncertain on this
> > change of vcpu_enter_guest():
> > 
> >         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> >                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >                 /*
> >                         * If this is requested, it means that we've
> >                         * marked the dirty bit in the dirty ring BUT
> >                         * we've not written the date.  Do it now.
> >                         */
> >                 r = kvm_emulate_instruction(vcpu, 0);
> >                 r = r >= 0 ? 0 : r;
> >                 goto out;
> >         }
> 
> This is not needed, it will just be a false negative (dirty page that
> actually isn't dirty).  The dirty bit will be cleared when userspace
> resets the ring buffer; then the instruction will be executed again and
> mark the page dirty again.  Since ring full is not a common condition,
> it's not a big deal.

Actually I added this only because it failed one of the unit tests
when verifying the dirty bits..  But now after a second thought, I
probably agree with you that we can change the userspace too to fix
this.

I think the steps of the failed test case could be simplified into
something like this (assuming the QEMU migration context, might be
easier to understand):

  1. page P has data P1
  2. vcpu writes to page P, with date P2
  3. vmexit (P is still with data P1)
  4. mark P as dirty, ring full, user exit
  5. collect dirty bit P, migrate P with data P1
  6. vcpu run due to some reason, P was written with P2, user exit again
     (because ring is already reaching soft limit)
  7. do KVM_RESET_DIRTY_RINGS
  8. never write to P again

Then P will be P1 always on destination, while it'll be P2 on source.

I think maybe that's why we need to be very sure that when userspace
exits happens (soft limit reached), we need to kick all the vcpus out,
and more importantly we must _not_ let them run again before the
KVM_RESET_DIRTY_PAGES otherwise we might face the data corrupt.  I'm
not sure whether we should mention this in the document to let the
userspace to be sure of the issue.

On the other side, I tried to remove the emulate_instruction() above
and fixed the test case, though I found that the last address before
user exit is not really written again after the next vmenter right
after KVM_RESET_DIRTY_RINGS, so the dirty bit was truly lost...  I'm
pasting some traces below (I added some tracepoints too, I think I'll
just keep them for v2):

  ...
  dirty_log_test-29003 [001] 184503.384328: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.384329: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.384329: kvm_page_fault:       address 7fc036d000 error_code 582
  dirty_log_test-29003 [001] 184503.384331: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.384332: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.384332: kvm_page_fault:       address 7fc036d000 error_code 582
  dirty_log_test-29003 [001] 184503.384332: kvm_dirty_ring_push:  ring 1: dirty 0x37f reset 0x1c0 slot 1 offset 0x37e ret 0 (used 447)
  dirty_log_test-29003 [001] 184503.384333: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.384334: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.384334: kvm_page_fault:       address 7fc036e000 error_code 582
  dirty_log_test-29003 [001] 184503.384336: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.384336: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.384336: kvm_page_fault:       address 7fc036e000 error_code 582
  dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_push:  ring 1: dirty 0x380 reset 0x1c0 slot 1 offset 0x37f ret 1 (used 448)
  dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_exit:  vcpu 1
  dirty_log_test-29003 [001] 184503.384338: kvm_fpu:              unload
  dirty_log_test-29003 [001] 184503.384340: kvm_userspace_exit:   reason 0x1d (29)
  dirty_log_test-29000 [006] 184503.505103: kvm_dirty_ring_reset: ring 1: dirty 0x380 reset 0x380 (used 0)
  dirty_log_test-29003 [001] 184503.505184: kvm_fpu:              load
  dirty_log_test-29003 [001] 184503.505187: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.505193: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.505194: kvm_page_fault:       address 7fc036f000 error_code 582              <-------- [1]
  dirty_log_test-29003 [001] 184503.505206: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.505207: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.505207: kvm_page_fault:       address 7fc036f000 error_code 582
  dirty_log_test-29003 [001] 184503.505226: kvm_dirty_ring_push:  ring 1: dirty 0x381 reset 0x380 slot 1 offset 0x380 ret 0 (used 1)
  dirty_log_test-29003 [001] 184503.505226: kvm_entry:            vcpu 1
  dirty_log_test-29003 [001] 184503.505227: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
  dirty_log_test-29003 [001] 184503.505228: kvm_page_fault:       address 7fc0370000 error_code 582
  dirty_log_test-29003 [001] 184503.505231: kvm_entry:            vcpu 1
  ...

The test was trying to continuously write to pages, from above log
starting from 7fc036d000. The reason 0x1d (29) is the new dirty ring
full exit reason.

So far I'm still unsure of two things:

  1. Why for each page we faulted twice rather than once.  Take the
     example of page at 7fc036e000 above, the first fault didn't
     trigger the marking dirty path, while only until the 2nd ept
     violation did we trigger kvm_dirty_ring_push.

  2. Why we didn't get the last page written again after
     kvm_userspace_exit (last page was 7fc036e000, and the test failed
     because 7fc036e000 detected change however dirty bit unset).  In
     this case the first write after KVM_RESET_DIRTY_RINGS is the line
     pointed by [1], I thought it should be a rewritten of page
     7fc036e000 because when the user exit happens logically the write
     should not happen yet and eip should keep.  However at [1] it's
     already writting to a new page.

I'll continue to dig tomorrow, or quick answers will be greatly
welcomed too. :)

> 
> > I did a kvm_emulate_instruction() when dirty ring reaches softlimit
> > and want to exit to userspace, however I'm not really sure whether
> > there could have any side effect.  I'd appreciate any comment of
> > above, or anything else.
> > 
> > Tests
> > ===========
> > 
> > I wanted to continue work on the QEMU part, but after I noticed that
> > the interface might still prone to change, I posted this series first.
> > However to make sure it's at least working, I've provided unit tests
> > together with the series.  The unit tests should be able to test the
> > series in at least three major paths:
> > 
> >   (1) ./dirty_log_test -M dirty-ring
> > 
> >       This tests async ring operations: this should be the major work
> >       mode for the dirty ring interface, say, when the kernel is
> >       queuing more data, the userspace is collecting too.  Ring can
> >       hardly reaches full when working like this, because in most
> >       cases the collection could be fast.
> > 
> >   (2) ./dirty_log_test -M dirty-ring -c 1024
> > 
> >       This set the ring size to be very small so that ring soft-full
> >       always triggers (soft-full is a soft limit of the ring state,
> >       when the dirty ring reaches the soft limit it'll do a userspace
> >       exit and let the userspace to collect the data).
> > 
> >   (3) ./dirty_log_test -M dirty-ring-wait-queue
> > 
> >       This sololy test the extreme case where ring is full.  When the
> >       ring is completely full, the thread (no matter vcpu or not) will
> >       be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
> >       wake the threads up (assuming until which the ring will not be
> >       full any more).
> 
> One question about this testcase: why does the task get into
> uninterruptible wait?

Because I'm using wait_event_killable() to wait when ring is
completely full.  I thought we should be strict there because it's
after all rare (even more rare than the soft-limit reached), and with
that we will never have a change to lose a dirty bit accidentally.  Or
do you think we should still respond to non fatal signals due to some
reason even during that wait period?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-11-29 21:34 ` [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size Peter Xu
@ 2019-12-02 19:30   ` Sean Christopherson
  2019-12-02 20:53     ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-02 19:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Fri, Nov 29, 2019 at 04:34:53PM -0500, Peter Xu wrote:
> It's already going to reach 2400 Bytes (which is over half of page
> size on 4K page archs), so maybe it's good to have this build-time
> check in case it overflows when adding new fields.

Please explain why exceeding PAGE_SIZE is a bad thing.  I realize it's
almost absurdly obvious when looking at the code, but a) the patch itself
does not provide that context and b) the changelog should hold up on its
own, e.g. in a mostly hypothetical case where the allocation of vcpu->run
were changed to something else.

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8f8940cc4b84..681452d288cd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -352,6 +352,8 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	}
>  	vcpu->run = page_address(page);
>  
> +	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
> +
>  	kvm_vcpu_set_in_spin_loop(vcpu, false);
>  	kvm_vcpu_set_dy_eligible(vcpu, false);
>  	vcpu->preempted = false;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
  2019-11-29 21:34 ` [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot Peter Xu
@ 2019-12-02 19:32   ` Sean Christopherson
  2019-12-02 20:49     ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-02 19:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Fri, Nov 29, 2019 at 04:34:52PM -0500, Peter Xu wrote:

Why?

> From: "Cao, Lei" <Lei.Cao@stratus.com>
> 
> Signed-off-by: Cao, Lei <Lei.Cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fac0760c870e..8f8940cc4b84 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -145,7 +145,10 @@ static void hardware_disable_all(void);
>  
>  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
>  
> -static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
> +static void mark_page_dirty_in_slot(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *memslot,
> +				    gfn_t gfn);

Why both?  Passing @vcpu gets you @kvm.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
@ 2019-12-02 20:10   ` Sean Christopherson
  2019-12-02 21:16     ` Peter Xu
  2019-12-03 19:13   ` Sean Christopherson
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-02 20:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> 
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
> 
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
> 
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
> 
> We defined two new data structures:
> 
>   struct kvm_dirty_ring;
>   struct kvm_dirty_ring_indexes;
> 
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
> 
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring.  Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
> 
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
> 
> Currently, we have N+1 rings for each VM of N vcpus:
> 
>   - for each vcpu, we have 1 per-vcpu dirty ring,
>   - for each vm, we have 1 per-vm dirty ring

Why?  I assume the purpose of per-vcpu rings is to avoid contention between
threads, but the motiviation needs to be explicitly stated.  And why is a
per-vm fallback ring needed?

If my assumption is correct, have other approaches been tried/profiled?
E.g. using cmpxchg to reserve N number of entries in a shared ring.  IMO,
adding kvm_get_running_vcpu() is a hack that is just asking for future
abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
look extremely fragile.  I also dislike having two different mechanisms
for accessing the ring (lock for per-vm, something else for per-vcpu).

> Please refer to the documentation update in this patch for more
> details.
> 
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now.  Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
> 
> [1] https://patchwork.kernel.org/patch/10471409/
> 
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

...

> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 size = kvm->dirty_ring_size;

Just pass in @size, that way you don't need @kvm.  And the callers will be
less ugly, e.g. the initial allocation won't need to speculatively set
kvm->dirty_ring_size.

> +
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit =
> +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -

And passing @size avoids issues like this where a local var is ignored.

> +	    kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	spin_lock_init(&ring->lock);
> +
> +	return 0;
> +}
> +

...

> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	if (ring->dirty_gfns) {

Why condition freeing the dirty ring on kvm->dirty_ring_size, this
obviously protects itself.  Not to mention vfree() also plays nice with a
NULL input.

> +		vfree(ring->dirty_gfns);
> +		ring->dirty_gfns = NULL;
> +	}
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>  
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>  
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_vcpu *vcpu,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>  
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>  
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> +		if (r) {
> +			kvm->dirty_ring_size = 0;
> +			goto fail_free_run;

This looks wrong, kvm->dirty_ring_size is used to free allocations, i.e.
previous allocations will leak if a vcpu allocation fails.

> +		}
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>  
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>  
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	struct kvm *kvm = kvm_arch_alloc_vm();
>  	int r = -ENOMEM;
>  	int i;
> +	struct page *page;
>  
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>  
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		r = -ENOMEM;
> +		goto out_err_alloc_page;
> +	}
> +	kvm->vm_run = page_address(page);
> +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_no_srcu;
>  	if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_no_srcu:
> +	free_page((unsigned long)page);
> +	kvm->vm_run = NULL;

No need to nullify vm_run.

> +out_err_alloc_page:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
>  	return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>  
> +	if (kvm->dirty_ring_size) {
> +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> +	}

Unnecessary parantheses.

> +
> +	if (kvm->vm_run) {
> +		free_page((unsigned long)kvm->vm_run);
> +		kvm->vm_run = NULL;
> +	}
> +
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
>  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-11-30  8:29 ` [PATCH RFC 00/15] KVM: Dirty ring interface Paolo Bonzini
  2019-12-02  2:13   ` Peter Xu
@ 2019-12-02 20:21   ` Sean Christopherson
  2019-12-02 20:43     ` Peter Xu
  1 sibling, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-02 20:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Sat, Nov 30, 2019 at 09:29:42AM +0100, Paolo Bonzini wrote:
> Hi Peter,
> 
> thanks for the RFC!  Just a couple comments before I look at the series
> (for which I don't expect many surprises).
> 
> On 29/11/19 22:34, Peter Xu wrote:
> > I marked this series as RFC because I'm at least uncertain on this
> > change of vcpu_enter_guest():
> > 
> >         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> >                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >                 /*
> >                         * If this is requested, it means that we've
> >                         * marked the dirty bit in the dirty ring BUT
> >                         * we've not written the date.  Do it now.
> >                         */
> >                 r = kvm_emulate_instruction(vcpu, 0);
> >                 r = r >= 0 ? 0 : r;
> >                 goto out;
> >         }
> 
> This is not needed, it will just be a false negative (dirty page that
> actually isn't dirty).  The dirty bit will be cleared when userspace
> resets the ring buffer; then the instruction will be executed again and
> mark the page dirty again.  Since ring full is not a common condition,
> it's not a big deal.

Side topic, KVM_REQ_DIRTY_RING_FULL is misnamed, it's set when a ring goes
above its soft limit, not when the ring is actually full.  It took quite a
bit of digging to figure out whether or not PML was broken...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-02 20:21   ` Sean Christopherson
@ 2019-12-02 20:43     ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-02 20:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 12:21:19PM -0800, Sean Christopherson wrote:
> On Sat, Nov 30, 2019 at 09:29:42AM +0100, Paolo Bonzini wrote:
> > Hi Peter,
> > 
> > thanks for the RFC!  Just a couple comments before I look at the series
> > (for which I don't expect many surprises).
> > 
> > On 29/11/19 22:34, Peter Xu wrote:
> > > I marked this series as RFC because I'm at least uncertain on this
> > > change of vcpu_enter_guest():
> > > 
> > >         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> > >                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > >                 /*
> > >                         * If this is requested, it means that we've
> > >                         * marked the dirty bit in the dirty ring BUT
> > >                         * we've not written the date.  Do it now.
> > >                         */
> > >                 r = kvm_emulate_instruction(vcpu, 0);
> > >                 r = r >= 0 ? 0 : r;
> > >                 goto out;
> > >         }
> > 
> > This is not needed, it will just be a false negative (dirty page that
> > actually isn't dirty).  The dirty bit will be cleared when userspace
> > resets the ring buffer; then the instruction will be executed again and
> > mark the page dirty again.  Since ring full is not a common condition,
> > it's not a big deal.
> 
> Side topic, KVM_REQ_DIRTY_RING_FULL is misnamed, it's set when a ring goes
> above its soft limit, not when the ring is actually full.  It took quite a
> bit of digging to figure out whether or not PML was broken...

Yeah it's indeed a bit confusing.

Do you like KVM_REQ_DIRTY_RING_COLLECT?  Pair with
KVM_EXIT_DIRTY_RING_COLLECT.  Or, suggestions?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
  2019-12-02 19:32   ` Sean Christopherson
@ 2019-12-02 20:49     ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-02 20:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 11:32:22AM -0800, Sean Christopherson wrote:
> On Fri, Nov 29, 2019 at 04:34:52PM -0500, Peter Xu wrote:
> 
> Why?

[1]

> 
> > From: "Cao, Lei" <Lei.Cao@stratus.com>
> > 
> > Signed-off-by: Cao, Lei <Lei.Cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  virt/kvm/kvm_main.c | 26 +++++++++++++++++---------
> >  1 file changed, 17 insertions(+), 9 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fac0760c870e..8f8940cc4b84 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -145,7 +145,10 @@ static void hardware_disable_all(void);
> >  
> >  static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
> >  
> > -static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
> > +static void mark_page_dirty_in_slot(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *memslot,
> > +				    gfn_t gfn);
> 
> Why both?  Passing @vcpu gets you @kvm.

You are right on that I should fill in something at [1]..

Because @vcpu can be NULL (if you continue to read this patch, you'll
see sometimes NULL is passed in), and we at least need a context to
mark the dirty ring.  That's also why we need a per-vm dirty ring to
be the fallback of the cases where we don't have vcpu context.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-12-02 19:30   ` Sean Christopherson
@ 2019-12-02 20:53     ` Peter Xu
  2019-12-02 22:19       ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-02 20:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 11:30:27AM -0800, Sean Christopherson wrote:
> On Fri, Nov 29, 2019 at 04:34:53PM -0500, Peter Xu wrote:
> > It's already going to reach 2400 Bytes (which is over half of page
> > size on 4K page archs), so maybe it's good to have this build-time
> > check in case it overflows when adding new fields.
> 
> Please explain why exceeding PAGE_SIZE is a bad thing.  I realize it's
> almost absurdly obvious when looking at the code, but a) the patch itself
> does not provide that context and b) the changelog should hold up on its
> own,

Right, I'll enhance the commit message.

> e.g. in a mostly hypothetical case where the allocation of vcpu->run
> were changed to something else.

And that's why I added BUILD_BUG_ON right beneath that allocation. :)

It's just a helper for developers when adding new kvm_run fields, not
a risk for anyone who wants to start allocating more pages for it.

Thanks,

> 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  virt/kvm/kvm_main.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 8f8940cc4b84..681452d288cd 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -352,6 +352,8 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	}
> >  	vcpu->run = page_address(page);
> >  
> > +	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
> > +
> >  	kvm_vcpu_set_in_spin_loop(vcpu, false);
> >  	kvm_vcpu_set_dy_eligible(vcpu, false);
> >  	vcpu->preempted = false;
> > -- 
> > 2.21.0
> > 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-02 20:10   ` Sean Christopherson
@ 2019-12-02 21:16     ` Peter Xu
  2019-12-02 21:50       ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-02 21:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> > 
> > We defined two new data structures:
> > 
> >   struct kvm_dirty_ring;
> >   struct kvm_dirty_ring_indexes;
> > 
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> > 
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring.  Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> > 
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> > 
> > Currently, we have N+1 rings for each VM of N vcpus:
> > 
> >   - for each vcpu, we have 1 per-vcpu dirty ring,
> >   - for each vm, we have 1 per-vm dirty ring
> 
> Why?  I assume the purpose of per-vcpu rings is to avoid contention between
> threads, but the motiviation needs to be explicitly stated.  And why is a
> per-vm fallback ring needed?

Yes, as explained in previous reply, the problem is there could have
guest memory writes without vcpu contexts.

> 
> If my assumption is correct, have other approaches been tried/profiled?
> E.g. using cmpxchg to reserve N number of entries in a shared ring.

Not yet, but I'd be fine to try anything if there's better
alternatives.  Besides, could you help explain why sharing one ring
and let each vcpu to reserve a region in the ring could be helpful in
the pov of performance?

> IMO,
> adding kvm_get_running_vcpu() is a hack that is just asking for future
> abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> look extremely fragile.

I agree.  Another way is to put heavier traffic to the per-vm ring,
but the downside could be that the per-vm ring could get full easier
(but I haven't tested).

> I also dislike having two different mechanisms
> for accessing the ring (lock for per-vm, something else for per-vcpu).

Actually I proposed to drop the per-vm ring (actually I had a version
that implemented this.. and I just changed it back to the per-vm ring
later on, see below) and when there's no vcpu context I thought about:

  (1) use vcpu0 ring

  (2) or a better algo to pick up a per-vcpu ring (like, the less full
      ring, we can do many things here, e.g., we can easily maintain a
      structure track this so we can get O(1) search, I think)

I discussed this with Paolo, but I think Paolo preferred the per-vm
ring because there's no good reason to choose vcpu0 as what (1)
suggested.  While if to choose (2) we probably need to lock even for
per-cpu ring, so could be a bit slower.

Since this is still RFC, I think we still have chance to change this,
depending on how the discussion goes.

> 
> > Please refer to the documentation update in this patch for more
> > details.
> > 
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now.  Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> > 
> > [1] https://patchwork.kernel.org/patch/10471409/
> > 
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> 
> ...
> 
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 size = kvm->dirty_ring_size;
> 
> Just pass in @size, that way you don't need @kvm.  And the callers will be
> less ugly, e.g. the initial allocation won't need to speculatively set
> kvm->dirty_ring_size.

Sure.

> 
> > +
> > +	ring->dirty_gfns = vmalloc(size);
> > +	if (!ring->dirty_gfns)
> > +		return -ENOMEM;
> > +	memset(ring->dirty_gfns, 0, size);
> > +
> > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > +	ring->soft_limit =
> > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> 
> And passing @size avoids issues like this where a local var is ignored.
> 
> > +	    kvm_dirty_ring_get_rsvd_entries();
> > +	ring->dirty_index = 0;
> > +	ring->reset_index = 0;
> > +	spin_lock_init(&ring->lock);
> > +
> > +	return 0;
> > +}
> > +
> 
> ...
> 
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > +{
> > +	if (ring->dirty_gfns) {
> 
> Why condition freeing the dirty ring on kvm->dirty_ring_size, this
> obviously protects itself.  Not to mention vfree() also plays nice with a
> NULL input.

Ok I can drop this check.

> 
> > +		vfree(ring->dirty_gfns);
> > +		ring->dirty_gfns = NULL;
> > +	}
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 681452d288cd..8642c977629b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -64,6 +64,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/kvm.h>
> >  
> > +#include <linux/kvm_dirty_ring.h>
> > +
> >  /* Worst case buffer size needed for holding an integer. */
> >  #define ITOA_MAX_LEN 12
> >  
> > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  				    struct kvm_vcpu *vcpu,
> >  				    struct kvm_memory_slot *memslot,
> >  				    gfn_t gfn);
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *slot,
> > +				    gfn_t gfn);
> >  
> >  __visible bool kvm_rebooting;
> >  EXPORT_SYMBOL_GPL(kvm_rebooting);
> > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	vcpu->preempted = false;
> >  	vcpu->ready = false;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > +		if (r) {
> > +			kvm->dirty_ring_size = 0;
> > +			goto fail_free_run;
> 
> This looks wrong, kvm->dirty_ring_size is used to free allocations, i.e.
> previous allocations will leak if a vcpu allocation fails.

You are right.  That's an overkill.

> 
> > +		}
> > +	}
> > +
> >  	r = kvm_arch_vcpu_init(vcpu);
> >  	if (r < 0)
> > -		goto fail_free_run;
> > +		goto fail_free_ring;
> >  	return 0;
> >  
> > +fail_free_ring:
> > +	if (kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  fail_free_run:
> >  	free_page((unsigned long)vcpu->run);
> >  fail:
> > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> >  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> >  	kvm_arch_vcpu_uninit(vcpu);
> >  	free_page((unsigned long)vcpu->run);
> > +	if (vcpu->kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> >  
> > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  	struct kvm *kvm = kvm_arch_alloc_vm();
> >  	int r = -ENOMEM;
> >  	int i;
> > +	struct page *page;
> >  
> >  	if (!kvm)
> >  		return ERR_PTR(-ENOMEM);
> > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  
> >  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >  
> > +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +	if (!page) {
> > +		r = -ENOMEM;
> > +		goto out_err_alloc_page;
> > +	}
> > +	kvm->vm_run = page_address(page);
> > +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > +
> >  	if (init_srcu_struct(&kvm->srcu))
> >  		goto out_err_no_srcu;
> >  	if (init_srcu_struct(&kvm->irq_srcu))
> > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  out_err_no_irq_srcu:
> >  	cleanup_srcu_struct(&kvm->srcu);
> >  out_err_no_srcu:
> > +	free_page((unsigned long)page);
> > +	kvm->vm_run = NULL;
> 
> No need to nullify vm_run.

Ok.

> 
> > +out_err_alloc_page:
> >  	kvm_arch_free_vm(kvm);
> >  	mmdrop(current->mm);
> >  	return ERR_PTR(r);
> > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >  	int i;
> >  	struct mm_struct *mm = kvm->mm;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > +	}
> 
> Unnecessary parantheses.

True.

Thanks,

> 
> > +
> > +	if (kvm->vm_run) {
> > +		free_page((unsigned long)kvm->vm_run);
> > +		kvm->vm_run = NULL;
> > +	}
> > +
> >  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> >  	kvm_destroy_vm_debugfs(kvm);
> >  	kvm_arch_sync_events(kvm);
> > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  {
> >  	if (memslot && memslot->dirty_bitmap) {
> >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > -
> > +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> >  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> >  	}
> >  }
> > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-02 21:16     ` Peter Xu
@ 2019-12-02 21:50       ` Sean Christopherson
  2019-12-02 23:09         ` Peter Xu
  2019-12-03 13:48         ` Paolo Bonzini
  0 siblings, 2 replies; 123+ messages in thread
From: Sean Christopherson @ 2019-12-02 21:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 04:16:40PM -0500, Peter Xu wrote:
> On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > Currently, we have N+1 rings for each VM of N vcpus:
> > > 
> > >   - for each vcpu, we have 1 per-vcpu dirty ring,
> > >   - for each vm, we have 1 per-vm dirty ring
> > 
> > Why?  I assume the purpose of per-vcpu rings is to avoid contention between
> > threads, but the motiviation needs to be explicitly stated.  And why is a
> > per-vm fallback ring needed?
> 
> Yes, as explained in previous reply, the problem is there could have
> guest memory writes without vcpu contexts.
> 
> > 
> > If my assumption is correct, have other approaches been tried/profiled?
> > E.g. using cmpxchg to reserve N number of entries in a shared ring.
> 
> Not yet, but I'd be fine to try anything if there's better
> alternatives.  Besides, could you help explain why sharing one ring
> and let each vcpu to reserve a region in the ring could be helpful in
> the pov of performance?

The goal would be to avoid taking a lock, or at least to avoid holding a
lock for an extended duration, e.g. some sort of multi-step process where
entries in the ring are first reserved, then filled, and finally marked
valid.  That'd allow the "fill" action to be done in parallel.

In case it isn't clear, I haven't thought through an actual solution :-).

My point is that I think it's worth exploring and profiling other
implementations because the dual per-vm and per-vcpu rings has a few warts
that we'd be stuck with forever.

> > IMO,
> > adding kvm_get_running_vcpu() is a hack that is just asking for future
> > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> > look extremely fragile.
> 
> I agree.  Another way is to put heavier traffic to the per-vm ring,
> but the downside could be that the per-vm ring could get full easier
> (but I haven't tested).

There's nothing that prevents increasing the size of the common ring each
time a new vCPU is added.  Alternatively, userspace could explicitly
request or hint the desired ring size.

> > I also dislike having two different mechanisms
> > for accessing the ring (lock for per-vm, something else for per-vcpu).
> 
> Actually I proposed to drop the per-vm ring (actually I had a version
> that implemented this.. and I just changed it back to the per-vm ring
> later on, see below) and when there's no vcpu context I thought about:
> 
>   (1) use vcpu0 ring
> 
>   (2) or a better algo to pick up a per-vcpu ring (like, the less full
>       ring, we can do many things here, e.g., we can easily maintain a
>       structure track this so we can get O(1) search, I think)
> 
> I discussed this with Paolo, but I think Paolo preferred the per-vm
> ring because there's no good reason to choose vcpu0 as what (1)
> suggested.  While if to choose (2) we probably need to lock even for
> per-cpu ring, so could be a bit slower.

Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
find a third option that provides comparable performance without using any
per-vcpu rings.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-12-02 20:53     ` Peter Xu
@ 2019-12-02 22:19       ` Sean Christopherson
  2019-12-02 22:40         ` Peter Xu
  2019-12-03 13:41         ` Paolo Bonzini
  0 siblings, 2 replies; 123+ messages in thread
From: Sean Christopherson @ 2019-12-02 22:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 03:53:15PM -0500, Peter Xu wrote:
> On Mon, Dec 02, 2019 at 11:30:27AM -0800, Sean Christopherson wrote:
> > On Fri, Nov 29, 2019 at 04:34:53PM -0500, Peter Xu wrote:
> > > It's already going to reach 2400 Bytes (which is over half of page
> > > size on 4K page archs), so maybe it's good to have this build-time
> > > check in case it overflows when adding new fields.
> > 
> > Please explain why exceeding PAGE_SIZE is a bad thing.  I realize it's
> > almost absurdly obvious when looking at the code, but a) the patch itself
> > does not provide that context and b) the changelog should hold up on its
> > own,
> 
> Right, I'll enhance the commit message.
> 
> > e.g. in a mostly hypothetical case where the allocation of vcpu->run
> > were changed to something else.
> 
> And that's why I added BUILD_BUG_ON right beneath that allocation. :)

My point is that if the allocation were changed to no longer be a
straightforward alloc_page() then someone reading the combined code would
have no idea why the BUILD_BUG_ON() exists.  It's a bit ridiculous for
this case because the specific constraints of vcpu->run make it highly
unlikely to use anything else, but that's beside the point.

> It's just a helper for developers when adding new kvm_run fields, not
> a risk for anyone who wants to start allocating more pages for it.

But by adding a BUILD_BUG_ON without explaining *why*, you're placing an
extra burden on someone that wants to increase the size of kvm->run, e.g.
it's not at all obvious from the changelog whether this patch is adding
the BUILD_BUG_ON purely because the code allocates memory for vcpu->run
via alloc_page(), or if there is some fundamental aspect of vcpu->run that
requires it to never span multiple pages.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-12-02 22:19       ` Sean Christopherson
@ 2019-12-02 22:40         ` Peter Xu
  2019-12-03  5:50           ` Sean Christopherson
  2019-12-03 13:41         ` Paolo Bonzini
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-02 22:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 02:19:49PM -0800, Sean Christopherson wrote:
> On Mon, Dec 02, 2019 at 03:53:15PM -0500, Peter Xu wrote:
> > On Mon, Dec 02, 2019 at 11:30:27AM -0800, Sean Christopherson wrote:
> > > On Fri, Nov 29, 2019 at 04:34:53PM -0500, Peter Xu wrote:
> > > > It's already going to reach 2400 Bytes (which is over half of page
> > > > size on 4K page archs), so maybe it's good to have this build-time
> > > > check in case it overflows when adding new fields.
> > > 
> > > Please explain why exceeding PAGE_SIZE is a bad thing.  I realize it's
> > > almost absurdly obvious when looking at the code, but a) the patch itself
> > > does not provide that context and b) the changelog should hold up on its
> > > own,
> > 
> > Right, I'll enhance the commit message.
> > 
> > > e.g. in a mostly hypothetical case where the allocation of vcpu->run
> > > were changed to something else.
> > 
> > And that's why I added BUILD_BUG_ON right beneath that allocation. :)
> 
> My point is that if the allocation were changed to no longer be a
> straightforward alloc_page() then someone reading the combined code would
> have no idea why the BUILD_BUG_ON() exists.  It's a bit ridiculous for
> this case because the specific constraints of vcpu->run make it highly
> unlikely to use anything else, but that's beside the point.
> 
> > It's just a helper for developers when adding new kvm_run fields, not
> > a risk for anyone who wants to start allocating more pages for it.
> 
> But by adding a BUILD_BUG_ON without explaining *why*, you're placing an
> extra burden on someone that wants to increase the size of kvm->run, e.g.
> it's not at all obvious from the changelog whether this patch is adding
> the BUILD_BUG_ON purely because the code allocates memory for vcpu->run
> via alloc_page(), or if there is some fundamental aspect of vcpu->run that
> requires it to never span multiple pages.

How about I add a comment above it?

  /*
   * Currently kvm_run only uses one physical page.  Warn the develper
   * if kvm_run accidentaly grows more than that.
   */
  BUILD_BUG_ON(...);

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-02 21:50       ` Sean Christopherson
@ 2019-12-02 23:09         ` Peter Xu
  2019-12-03 13:48         ` Paolo Bonzini
  1 sibling, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-02 23:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 01:50:49PM -0800, Sean Christopherson wrote:
> On Mon, Dec 02, 2019 at 04:16:40PM -0500, Peter Xu wrote:
> > On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> > > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > > Currently, we have N+1 rings for each VM of N vcpus:
> > > > 
> > > >   - for each vcpu, we have 1 per-vcpu dirty ring,
> > > >   - for each vm, we have 1 per-vm dirty ring
> > > 
> > > Why?  I assume the purpose of per-vcpu rings is to avoid contention between
> > > threads, but the motiviation needs to be explicitly stated.  And why is a
> > > per-vm fallback ring needed?
> > 
> > Yes, as explained in previous reply, the problem is there could have
> > guest memory writes without vcpu contexts.
> > 
> > > 
> > > If my assumption is correct, have other approaches been tried/profiled?
> > > E.g. using cmpxchg to reserve N number of entries in a shared ring.
> > 
> > Not yet, but I'd be fine to try anything if there's better
> > alternatives.  Besides, could you help explain why sharing one ring
> > and let each vcpu to reserve a region in the ring could be helpful in
> > the pov of performance?
> 
> The goal would be to avoid taking a lock, or at least to avoid holding a
> lock for an extended duration, e.g. some sort of multi-step process where
> entries in the ring are first reserved, then filled, and finally marked
> valid.  That'd allow the "fill" action to be done in parallel.

Considering that per-vcpu ring should be no worst than this, so iiuc
you prefer a single per-vm ring here, which is without per-vcpu ring.
However I don't see a good reason to split a per-vm resource into
per-vcpu manually somehow, instead of using the per-vcpu structure
directly like what this series does...  Or could you show me what I've
missed?

IMHO it's really a natural thought that we should use kvm_vcpu to
split the ring as long as we still want to make it in parallel of the
vcpus.

> 
> In case it isn't clear, I haven't thought through an actual solution :-).

Feel free to shoot when the ideas come. :) I'd be glad to test your
idea, especially where it could be better!

> 
> My point is that I think it's worth exploring and profiling other
> implementations because the dual per-vm and per-vcpu rings has a few warts
> that we'd be stuck with forever.

I do agree that the interface could be a bit awkward to keep these two
rings.  Besides this, do you still have other concerns?

And when you say about profiling, I hope I understand it right that it
should be something unrelated to this specific issue that we're
discussing (say, on whether to use per-vm ring, or per-vm + per-vcpu
rings) because for performance imho it's really the layout of the ring
that could matter more, and how the ring is shared and accessed
between the userspace and kernel.

For current implementation (I'm not sure whether that's initial
version from Lei, or Paolo, anyway...), IMHO it's good enough from
perf pov in that it at least supports:

  (1) zero copy
  (2) complete async model
  (3) per-vcpu isolations

None of these is there for KVM_GET_DIRTY_LOG.  Not to mention that
tracking dirty bits are not really that "performance critical" - if
you see in QEMU we have plenty of ways to explicitly turn down the CPU
like cpu-throttle, just because dirtying pages and even with the whole
tracking overhead is too fast already even using KVM_GET_DIRTY_LOG,
and the slow thing is QEMU when collecting and sending the pages! :)

> 
> > > IMO,
> > > adding kvm_get_running_vcpu() is a hack that is just asking for future
> > > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> > > look extremely fragile.
> > 
> > I agree.  Another way is to put heavier traffic to the per-vm ring,
> > but the downside could be that the per-vm ring could get full easier
> > (but I haven't tested).
> 
> There's nothing that prevents increasing the size of the common ring each
> time a new vCPU is added.  Alternatively, userspace could explicitly
> request or hint the desired ring size.

Yeah I don't have strong opinion on this, but I just don't see it
greatly helpful to explicitly expose this API to userspace.  IMHO for
now a global ring size should be good enough.  If userspace wants to
make it fast, the ring can hardly gets full (because the collection of
the dirty ring can be really, really fast if the userspace wants).

> 
> > > I also dislike having two different mechanisms
> > > for accessing the ring (lock for per-vm, something else for per-vcpu).
> > 
> > Actually I proposed to drop the per-vm ring (actually I had a version
> > that implemented this.. and I just changed it back to the per-vm ring
> > later on, see below) and when there's no vcpu context I thought about:
> > 
> >   (1) use vcpu0 ring
> > 
> >   (2) or a better algo to pick up a per-vcpu ring (like, the less full
> >       ring, we can do many things here, e.g., we can easily maintain a
> >       structure track this so we can get O(1) search, I think)
> > 
> > I discussed this with Paolo, but I think Paolo preferred the per-vm
> > ring because there's no good reason to choose vcpu0 as what (1)
> > suggested.  While if to choose (2) we probably need to lock even for
> > per-cpu ring, so could be a bit slower.
> 
> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> find a third option that provides comparable performance without using any
> per-vcpu rings.

I'm still uncertain on whether it's a good idea to drop the per-vcpu
ring (as stated above).  But I'm still open to any further thoughts
as long as I can start to understand when the only-per-vm ring would
be better.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-12-02 22:40         ` Peter Xu
@ 2019-12-03  5:50           ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2019-12-03  5:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Mon, Dec 02, 2019 at 05:40:34PM -0500, Peter Xu wrote:
> On Mon, Dec 02, 2019 at 02:19:49PM -0800, Sean Christopherson wrote:
> > On Mon, Dec 02, 2019 at 03:53:15PM -0500, Peter Xu wrote:
> > > On Mon, Dec 02, 2019 at 11:30:27AM -0800, Sean Christopherson wrote:
> > > > On Fri, Nov 29, 2019 at 04:34:53PM -0500, Peter Xu wrote:
> > > > > It's already going to reach 2400 Bytes (which is over half of page
> > > > > size on 4K page archs), so maybe it's good to have this build-time
> > > > > check in case it overflows when adding new fields.
> > > > 
> > > > Please explain why exceeding PAGE_SIZE is a bad thing.  I realize it's
> > > > almost absurdly obvious when looking at the code, but a) the patch itself
> > > > does not provide that context and b) the changelog should hold up on its
> > > > own,
> > > 
> > > Right, I'll enhance the commit message.
> > > 
> > > > e.g. in a mostly hypothetical case where the allocation of vcpu->run
> > > > were changed to something else.
> > > 
> > > And that's why I added BUILD_BUG_ON right beneath that allocation. :)
> > 
> > My point is that if the allocation were changed to no longer be a
> > straightforward alloc_page() then someone reading the combined code would
> > have no idea why the BUILD_BUG_ON() exists.  It's a bit ridiculous for
> > this case because the specific constraints of vcpu->run make it highly
> > unlikely to use anything else, but that's beside the point.
> > 
> > > It's just a helper for developers when adding new kvm_run fields, not
> > > a risk for anyone who wants to start allocating more pages for it.
> > 
> > But by adding a BUILD_BUG_ON without explaining *why*, you're placing an
> > extra burden on someone that wants to increase the size of kvm->run, e.g.
> > it's not at all obvious from the changelog whether this patch is adding
> > the BUILD_BUG_ON purely because the code allocates memory for vcpu->run
> > via alloc_page(), or if there is some fundamental aspect of vcpu->run that
> > requires it to never span multiple pages.
> 
> How about I add a comment above it?
> 
>   /*
>    * Currently kvm_run only uses one physical page.  Warn the develper
>    * if kvm_run accidentaly grows more than that.
>    */
>   BUILD_BUG_ON(...);

No need for a comment, adding a blurb in the changelog is sufficient.

The lengthy response was just trying to explain why it's helpful to
explicitly justify a change that may seem obvious in the current codebase.
Apologies if it only confused things.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-12-02 22:19       ` Sean Christopherson
  2019-12-02 22:40         ` Peter Xu
@ 2019-12-03 13:41         ` Paolo Bonzini
  2019-12-03 17:04           ` Peter Xu
  1 sibling, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-03 13:41 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 02/12/19 23:19, Sean Christopherson wrote:
>>> e.g. in a mostly hypothetical case where the allocation of vcpu->run
>>> were changed to something else.
>> And that's why I added BUILD_BUG_ON right beneath that allocation. :)

It's not exactly beneath it (it's out of the patch context at least).  I
think a comment is not strictly necessary, but a better commit message
is and, since you are at it, I would put the BUILD_BUG_ON *before* the
allocation.  That makes it more obvious that you are checking the
invariant before allocating.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-02 21:50       ` Sean Christopherson
  2019-12-02 23:09         ` Peter Xu
@ 2019-12-03 13:48         ` Paolo Bonzini
  2019-12-03 18:46           ` Sean Christopherson
  1 sibling, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-03 13:48 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 02/12/19 22:50, Sean Christopherson wrote:
>>
>> I discussed this with Paolo, but I think Paolo preferred the per-vm
>> ring because there's no good reason to choose vcpu0 as what (1)
>> suggested.  While if to choose (2) we probably need to lock even for
>> per-cpu ring, so could be a bit slower.
> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> find a third option that provides comparable performance without using any
> per-vcpu rings.
> 

The advantage of per-vCPU rings is that it naturally: 1) parallelizes
the processing of dirty pages; 2) makes userspace vCPU thread do more
work on vCPUs that dirty more pages.

I agree that on the producer side we could reserve multiple entries in
the case of PML (and without PML only one entry should be added at a
time).  But I'm afraid that things get ugly when the ring is full,
because you'd have to wait for all vCPUs to finish publishing the
entries they have reserved.

It's ugly that we _also_ need a per-VM ring, but unfortunately some
operations do not really have a vCPU that they can refer to.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-02  2:13   ` Peter Xu
@ 2019-12-03 13:59     ` Paolo Bonzini
  2019-12-05 19:30       ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-03 13:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 02/12/19 03:13, Peter Xu wrote:
>> This is not needed, it will just be a false negative (dirty page that
>> actually isn't dirty).  The dirty bit will be cleared when userspace
>> resets the ring buffer; then the instruction will be executed again and
>> mark the page dirty again.  Since ring full is not a common condition,
>> it's not a big deal.
> 
> Actually I added this only because it failed one of the unit tests
> when verifying the dirty bits..  But now after a second thought, I
> probably agree with you that we can change the userspace too to fix
> this.

I think there is already a similar case in dirty_log_test when a page is
dirty but we called KVM_GET_DIRTY_LOG just before it got written to.

> I think the steps of the failed test case could be simplified into
> something like this (assuming the QEMU migration context, might be
> easier to understand):
> 
>   1. page P has data P1
>   2. vcpu writes to page P, with date P2
>   3. vmexit (P is still with data P1)
>   4. mark P as dirty, ring full, user exit
>   5. collect dirty bit P, migrate P with data P1
>   6. vcpu run due to some reason, P was written with P2, user exit again
>      (because ring is already reaching soft limit)
>   7. do KVM_RESET_DIRTY_RINGS

Migration should only be done after KVM_RESET_DIRTY_RINGS (think of
KVM_RESET_DIRTY_RINGS as the equivalent of KVM_CLEAR_DIRTY_LOG).

>   dirty_log_test-29003 [001] 184503.384328: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.384329: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.384329: kvm_page_fault:       address 7fc036d000 error_code 582
>   dirty_log_test-29003 [001] 184503.384331: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.384332: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.384332: kvm_page_fault:       address 7fc036d000 error_code 582
>   dirty_log_test-29003 [001] 184503.384332: kvm_dirty_ring_push:  ring 1: dirty 0x37f reset 0x1c0 slot 1 offset 0x37e ret 0 (used 447)
>   dirty_log_test-29003 [001] 184503.384333: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.384334: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.384334: kvm_page_fault:       address 7fc036e000 error_code 582
>   dirty_log_test-29003 [001] 184503.384336: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.384336: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.384336: kvm_page_fault:       address 7fc036e000 error_code 582
>   dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_push:  ring 1: dirty 0x380 reset 0x1c0 slot 1 offset 0x37f ret 1 (used 448)
>   dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_exit:  vcpu 1
>   dirty_log_test-29003 [001] 184503.384338: kvm_fpu:              unload
>   dirty_log_test-29003 [001] 184503.384340: kvm_userspace_exit:   reason 0x1d (29)
>   dirty_log_test-29000 [006] 184503.505103: kvm_dirty_ring_reset: ring 1: dirty 0x380 reset 0x380 (used 0)
>   dirty_log_test-29003 [001] 184503.505184: kvm_fpu:              load
>   dirty_log_test-29003 [001] 184503.505187: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.505193: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.505194: kvm_page_fault:       address 7fc036f000 error_code 582              <-------- [1]
>   dirty_log_test-29003 [001] 184503.505206: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.505207: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.505207: kvm_page_fault:       address 7fc036f000 error_code 582
>   dirty_log_test-29003 [001] 184503.505226: kvm_dirty_ring_push:  ring 1: dirty 0x381 reset 0x380 slot 1 offset 0x380 ret 0 (used 1)
>   dirty_log_test-29003 [001] 184503.505226: kvm_entry:            vcpu 1
>   dirty_log_test-29003 [001] 184503.505227: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
>   dirty_log_test-29003 [001] 184503.505228: kvm_page_fault:       address 7fc0370000 error_code 582
>   dirty_log_test-29003 [001] 184503.505231: kvm_entry:            vcpu 1
>   ...
> 
> The test was trying to continuously write to pages, from above log
> starting from 7fc036d000. The reason 0x1d (29) is the new dirty ring
> full exit reason.
> 
> So far I'm still unsure of two things:
> 
>   1. Why for each page we faulted twice rather than once.  Take the
>      example of page at 7fc036e000 above, the first fault didn't
>      trigger the marking dirty path, while only until the 2nd ept
>      violation did we trigger kvm_dirty_ring_push.

Not sure about that.  Try enabling kvmmmu tracepoints too, it will tell
you more of the path that was taken while processing the EPT violation.

If your machine has PML, what you're seeing is likely not-present
violation, not dirty-protect violation.  Try disabling pml and see if
the trace changes.

>   2. Why we didn't get the last page written again after
>      kvm_userspace_exit (last page was 7fc036e000, and the test failed
>      because 7fc036e000 detected change however dirty bit unset).  In
>      this case the first write after KVM_RESET_DIRTY_RINGS is the line
>      pointed by [1], I thought it should be a rewritten of page
>      7fc036e000 because when the user exit happens logically the write
>      should not happen yet and eip should keep.  However at [1] it's
>      already writting to a new page.

IIUC you should get, with PML enabled:

- guest writes to page
- PML marks dirty bit, causes vmexit
- host copies PML log to ring, causes userspace exit
- userspace calls KVM_RESET_DIRTY_RINGS
  - host marks page as clean
- userspace calls KVM_RUN
  - guest writes again to page

but the page won't be in the ring until after another vmexit happens.
Therefore, it's okay to reap the pages in the ring asynchronously, but
there must be a synchronization point in the testcase sooner or later,
where all CPUs are kicked out of KVM_RUN.  This synchronization point
corresponds to the migration downtime.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size
  2019-12-03 13:41         ` Paolo Bonzini
@ 2019-12-03 17:04           ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-03 17:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Tue, Dec 03, 2019 at 02:41:58PM +0100, Paolo Bonzini wrote:
> On 02/12/19 23:19, Sean Christopherson wrote:
> >>> e.g. in a mostly hypothetical case where the allocation of vcpu->run
> >>> were changed to something else.
> >> And that's why I added BUILD_BUG_ON right beneath that allocation. :)
> 
> It's not exactly beneath it (it's out of the patch context at least).  I
> think a comment is not strictly necessary, but a better commit message
> is and, since you are at it, I would put the BUILD_BUG_ON *before* the
> allocation.  That makes it more obvious that you are checking the
> invariant before allocating.

Makes sense, will do.  Thanks for both of your reviews.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-03 13:48         ` Paolo Bonzini
@ 2019-12-03 18:46           ` Sean Christopherson
  2019-12-04 10:05             ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-03 18:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
> On 02/12/19 22:50, Sean Christopherson wrote:
> >>
> >> I discussed this with Paolo, but I think Paolo preferred the per-vm
> >> ring because there's no good reason to choose vcpu0 as what (1)
> >> suggested.  While if to choose (2) we probably need to lock even for
> >> per-cpu ring, so could be a bit slower.
> > Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> > find a third option that provides comparable performance without using any
> > per-vcpu rings.
> > 
> 
> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
> the processing of dirty pages; 2) makes userspace vCPU thread do more
> work on vCPUs that dirty more pages.
> 
> I agree that on the producer side we could reserve multiple entries in
> the case of PML (and without PML only one entry should be added at a
> time).  But I'm afraid that things get ugly when the ring is full,
> because you'd have to wait for all vCPUs to finish publishing the
> entries they have reserved.

Ah, I take it the intended model is that userspace will only start pulling
entries off the ring when KVM explicitly signals that the ring is "full"?

Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
dirty, a vCPU claims a ring, pushes the pages into the ring, and then
returns the ring to the pool.  If pushing pages hits the soft limit, a
request is made to drain the ring and the ring is not returned to the pool
until it is drained.

Except for acquiring a ring, which likely can be heavily optimized, that'd
allow parallel processing (#1), and would provide a facsimile of #2 as
pushing more pages onto a ring would naturally increase the likelihood of
triggering a drain.  And it might be interesting to see the effect of using
different methods of ring selection, e.g. pure round robin, LRU, last used
on the current vCPU, etc...

> It's ugly that we _also_ need a per-VM ring, but unfortunately some
> operations do not really have a vCPU that they can refer to.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code
  2019-11-29 21:34 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
@ 2019-12-03 19:01   ` Sean Christopherson
  2019-12-04  9:42     ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-03 19:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Fri, Nov 29, 2019 at 04:34:51PM -0500, Peter Xu wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> For ring-based dirty log tracking, it will be more efficient to account
> writes during schedule-out or schedule-in to the currently running VCPU.
> We would like to do it even if the write doesn't use the current VCPU's
> address space, as is the case for cached writes (see commit 4e335d9e7ddb,
> "Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).
> 
> Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
> There is already something similar in KVM/ARM; one important difference
> is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
> we have to update both the architecture-independent vcpu_{load,put} and
> the preempt notifiers.
> 
> Another change made in the process is to allow using kvm_get_running_vcpu()
> in preemptible code.  This is allowed because preempt notifiers ensure
> that the value does not change even after the VCPU thread is migrated.

In case it was clear, I strongly dislike adding kvm_get_running_vcpu().
IMO, it's a unnecessary hack.  The proper change to ensure a valid vCPU is
seen by mark_page_dirty_in_ring() when there is a current vCPU is to
plumb the vCPU down through the various call stacks.  Looking up the call
stacks for mark_page_dirty() and mark_page_dirty_in_slot(), they all
originate with a vcpu->kvm within a few functions, except for the rare
case where the write is coming from a non-vcpu ioctl(), in which case
there is no current vCPU.

The proper change is obviously much bigger in scope and would require
touching gobs of arch specific code, but IMO the end result would be worth
the effort.  E.g. there's a decent chance it would reduce the API between
common KVM and arch specific code by eliminating the exports of variants
that take "struct kvm *" instead of "struct kvm_vcpu *".

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
  2019-12-02 20:10   ` Sean Christopherson
@ 2019-12-03 19:13   ` Sean Christopherson
  2019-12-04 10:14     ` Paolo Bonzini
  2019-12-04 10:38   ` Jason Wang
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-03 19:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	u32 as_id = 0;

Redundant initialization of as_id.

> +	u64 offset;
> +	int ret;
> +	struct kvm_dirty_ring *ring;
> +	struct kvm_dirty_ring_indexes *indexes;
> +	bool is_vm_ring;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	if (vcpu) {
> +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	} else {
> +		as_id = 0;

The setting of as_id is wrong, both with and without a vCPU.  as_id should
come from slot->as_id.  It may not be actually broken in the current code
base, but at best it's fragile, e.g. Ben's TDP MMU rewrite[*] adds a call
to mark_page_dirty_in_slot() with a potentially non-zero as_id.

[*] https://lkml.kernel.org/r/20190926231824.149014-25-bgardon@google.com

> +		vcpu = kvm_get_running_vcpu();
> +	}
> +
> +	if (vcpu) {
> +		ring = &vcpu->dirty_ring;
> +		indexes = &vcpu->run->vcpu_ring_indexes;
> +		is_vm_ring = false;
> +	} else {
> +		/*
> +		 * Put onto per vm ring because no vcpu context.  Kick
> +		 * vcpu0 if ring is full.
> +		 */
> +		vcpu = kvm->vcpus[0];

Is this a rare event?

> +		ring = &kvm->vm_dirty_ring;
> +		indexes = &kvm->vm_run->vm_ring_indexes;
> +		is_vm_ring = true;
> +	}
> +
> +	ret = kvm_dirty_ring_push(ring, indexes,
> +				  (as_id << 16)|slot->id, offset,
> +				  is_vm_ring);
> +	if (ret < 0) {
> +		if (is_vm_ring)
> +			pr_warn_once("vcpu %d dirty log overflow\n",
> +				     vcpu->vcpu_id);
> +		else
> +			pr_warn_once("per-vm dirty log overflow\n");
> +		return;
> +	}
> +
> +	if (ret)
> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code
  2019-12-03 19:01   ` Sean Christopherson
@ 2019-12-04  9:42     ` Paolo Bonzini
  2019-12-09 22:05       ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-04  9:42 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 03/12/19 20:01, Sean Christopherson wrote:
> In case it was clear, I strongly dislike adding kvm_get_running_vcpu().
> IMO, it's a unnecessary hack.  The proper change to ensure a valid vCPU is
> seen by mark_page_dirty_in_ring() when there is a current vCPU is to
> plumb the vCPU down through the various call stacks.  Looking up the call
> stacks for mark_page_dirty() and mark_page_dirty_in_slot(), they all
> originate with a vcpu->kvm within a few functions, except for the rare
> case where the write is coming from a non-vcpu ioctl(), in which case
> there is no current vCPU.
> 
> The proper change is obviously much bigger in scope and would require
> touching gobs of arch specific code, but IMO the end result would be worth
> the effort.  E.g. there's a decent chance it would reduce the API between
> common KVM and arch specific code by eliminating the exports of variants
> that take "struct kvm *" instead of "struct kvm_vcpu *".

It's not that simple.  In some cases, the "struct kvm *" cannot be
easily replaced with a "struct kvm_vcpu *" without making the API less
intuitive; for example think of a function that takes a kvm_vcpu pointer
but then calls gfn_to_hva(vcpu->kvm) instead of the expected
kvm_vcpu_gfn_to_hva(vcpu).

That said, looking at the code again after a couple years I agree that
the usage of kvm_get_running_vcpu() is ugly.  But I don't think it's
kvm_get_running_vcpu()'s fault, rather it's the vCPU argument in
mark_page_dirty_in_slot and mark_page_dirty_in_ring that is confusing
and we should not be adding.

kvm_get_running_vcpu() basically means "you can use the per-vCPU ring
and avoid locking", nothing more.  Right now we need the vCPU argument
in mark_page_dirty_in_ring for kvm_arch_vcpu_memslots_id(vcpu), but that
is unnecessary and is the real source of confusion (possibly bugs too)
if it gets out of sync.

Instead, let's add an as_id field to struct kvm_memory_slot (which is
trivial to initialize in __kvm_set_memory_region), and just do

	as_id = slot->as_id;
	vcpu = kvm_get_running_vcpu();

in mark_page_dirty_in_ring.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-03 18:46           ` Sean Christopherson
@ 2019-12-04 10:05             ` Paolo Bonzini
  2019-12-07  0:29               ` Sean Christopherson
  2019-12-09 21:54               ` Peter Xu
  0 siblings, 2 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-04 10:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 03/12/19 19:46, Sean Christopherson wrote:
> On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
>> On 02/12/19 22:50, Sean Christopherson wrote:
>>>>
>>>> I discussed this with Paolo, but I think Paolo preferred the per-vm
>>>> ring because there's no good reason to choose vcpu0 as what (1)
>>>> suggested.  While if to choose (2) we probably need to lock even for
>>>> per-cpu ring, so could be a bit slower.
>>> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
>>> find a third option that provides comparable performance without using any
>>> per-vcpu rings.
>>>
>>
>> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
>> the processing of dirty pages; 2) makes userspace vCPU thread do more
>> work on vCPUs that dirty more pages.
>>
>> I agree that on the producer side we could reserve multiple entries in
>> the case of PML (and without PML only one entry should be added at a
>> time).  But I'm afraid that things get ugly when the ring is full,
>> because you'd have to wait for all vCPUs to finish publishing the
>> entries they have reserved.
> 
> Ah, I take it the intended model is that userspace will only start pulling
> entries off the ring when KVM explicitly signals that the ring is "full"?

No, it's not.  But perhaps in the asynchronous case you can delay
pushing the reserved entries to the consumer until a moment where no
CPUs have left empty slots in the ring buffer (somebody must have done
multi-producer ring buffers before).  In the ring-full case that is
harder because it requires synchronization.

> Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
> a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
> dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> returns the ring to the pool.  If pushing pages hits the soft limit, a
> request is made to drain the ring and the ring is not returned to the pool
> until it is drained.
> 
> Except for acquiring a ring, which likely can be heavily optimized, that'd
> allow parallel processing (#1), and would provide a facsimile of #2 as
> pushing more pages onto a ring would naturally increase the likelihood of
> triggering a drain.  And it might be interesting to see the effect of using
> different methods of ring selection, e.g. pure round robin, LRU, last used
> on the current vCPU, etc...

If you are creating nr_vcpus rings, and draining is done on the vCPU
thread that has filled the ring, why not create nr_vcpus+1?  The current
code then is exactly the same as pre-claiming a ring per vCPU and never
releasing it, and using a spinlock to claim the per-VM ring.

However, we could build on top of my other suggestion to add
slot->as_id, and wrap kvm_get_running_vcpu() with a nice API, mimicking
exactly what you've suggested.  Maybe even add a scary comment around
kvm_get_running_vcpu() suggesting that users only do so to avoid locking
and wrap it with a nice API.  Similar to what get_cpu/put_cpu do with
smp_processor_id.

1) Add a pointer from struct kvm_dirty_ring to struct
kvm_dirty_ring_indexes:

vcpu->dirty_ring->data = &vcpu->run->vcpu_ring_indexes;
kvm->vm_dirty_ring->data = *kvm->vm_run->vm_ring_indexes;

2) push the ring choice and locking to two new functions

struct kvm_ring *kvm_get_dirty_ring(struct kvm *kvm)
{
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

	if (vcpu && !WARN_ON_ONCE(vcpu->kvm != kvm)) {
		return &vcpu->dirty_ring;
	} else {
		/*
		 * Put onto per vm ring because no vcpu context.
		 * We'll kick vcpu0 if ring is full.
		 */
		spin_lock(&kvm->vm_dirty_ring->lock);
		return &kvm->vm_dirty_ring;
	}
}

void kvm_put_dirty_ring(struct kvm *kvm,
			struct kvm_dirty_ring *ring)
{
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
	bool full = kvm_dirty_ring_used(ring) >= ring->soft_limit;

	if (ring == &kvm->vm_dirty_ring) {
		if (vcpu == NULL)
			vcpu = kvm->vcpus[0];
		spin_unlock(&kvm->vm_dirty_ring->lock);
	}

	if (full)
		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
}

3) simplify kvm_dirty_ring_push to

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
			 u32 slot, u64 offset)
{
	/* left as an exercise to the reader */
}

and mark_page_dirty_in_ring to

static void mark_page_dirty_in_ring(struct kvm *kvm,
				    struct kvm_memory_slot *slot,
				    gfn_t gfn)
{
	struct kvm_dirty_ring *ring;

	if (!kvm->dirty_ring_size)
		return;

	ring = kvm_get_dirty_ring(kvm);
	kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
			    gfn - slot->base_gfn);
	kvm_put_dirty_ring(kvm, ring);
}

Paolo

>> It's ugly that we _also_ need a per-VM ring, but unfortunately some
>> operations do not really have a vCPU that they can refer to.
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-03 19:13   ` Sean Christopherson
@ 2019-12-04 10:14     ` Paolo Bonzini
  2019-12-04 14:33       ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-04 10:14 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 03/12/19 20:13, Sean Christopherson wrote:
> The setting of as_id is wrong, both with and without a vCPU.  as_id should
> come from slot->as_id.

Which doesn't exist, but is an excellent suggestion nevertheless.

>> +		/*
>> +		 * Put onto per vm ring because no vcpu context.  Kick
>> +		 * vcpu0 if ring is full.
>> +		 */
>> +		vcpu = kvm->vcpus[0];
> 
> Is this a rare event?

Yes, every time a vCPU exit happens, the vCPU is supposed to reap the VM
ring as well.  (Most of the time it will be empty, and while the reaping
of VM ring entries needs locking, the emptiness check doesn't).

Paolo

>> +		ring = &kvm->vm_dirty_ring;
>> +		indexes = &kvm->vm_run->vm_ring_indexes;
>> +		is_vm_ring = true;
>> +	}
>> +
>> +	ret = kvm_dirty_ring_push(ring, indexes,
>> +				  (as_id << 16)|slot->id, offset,
>> +				  is_vm_ring);
>> +	if (ret < 0) {
>> +		if (is_vm_ring)
>> +			pr_warn_once("vcpu %d dirty log overflow\n",
>> +				     vcpu->vcpu_id);
>> +		else
>> +			pr_warn_once("per-vm dirty log overflow\n");
>> +		return;
>> +	}
>> +
>> +	if (ret)
>> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
>> +}
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
  2019-12-02 20:10   ` Sean Christopherson
  2019-12-03 19:13   ` Sean Christopherson
@ 2019-12-04 10:38   ` Jason Wang
  2019-12-04 11:04     ` Paolo Bonzini
  2019-12-11 12:53   ` Michael S. Tsirkin
  2019-12-11 17:24   ` Christophe de Dinechin
  4 siblings, 1 reply; 123+ messages in thread
From: Jason Wang @ 2019-12-04 10:38 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Michael S. Tsirkin


On 2019/11/30 上午5:34, Peter Xu wrote:
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock)
> +{
> +	int ret;
> +	struct kvm_dirty_gfn *entry;
> +
> +	if (lock)
> +		spin_lock(&ring->lock);
> +
> +	if (kvm_dirty_ring_full(ring)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;


Haven't gone through the whole series, sorry if it was a silly question 
but I wonder things like this will suffer from similar issue on 
virtually tagged archs as mentioned in [1].

Is this better to allocate the ring from userspace and set to KVM 
instead? Then we can use copy_to/from_user() friends (a little bit slow 
on recent CPUs).

[1] https://lkml.org/lkml/2019/4/9/5

Thanks


> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +	pr_info("%s: slot %u offset %llu used %u\n",
> +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (15 preceding siblings ...)
  2019-11-30  8:29 ` [PATCH RFC 00/15] KVM: Dirty ring interface Paolo Bonzini
@ 2019-12-04 10:39 ` Jason Wang
  2019-12-04 19:33   ` Peter Xu
  2019-12-11 13:41 ` Christophe de Dinechin
  17 siblings, 1 reply; 123+ messages in thread
From: Jason Wang @ 2019-12-04 10:39 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Michael S. Tsirkin


On 2019/11/30 上午5:34, Peter Xu wrote:
> Branch is here:https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>
> Overview
> ============
>
> This is a continued work from Lei Cao<lei.cao@stratus.com>  and Paolo
> on the KVM dirty ring interface.  To make it simple, I'll still start
> with version 1 as RFC.
>
> The new dirty ring interface is another way to collect dirty pages for
> the virtual machine, but it is different from the existing dirty
> logging interface in a few ways, majorly:
>
>    - Data format: The dirty data was in a ring format rather than a
>      bitmap format, so the size of data to sync for dirty logging does
>      not depend on the size of guest memory any more, but speed of
>      dirtying.  Also, the dirty ring is per-vcpu (currently plus
>      another per-vm ring, so total ring number is N+1), while the dirty
>      bitmap is per-vm.
>
>    - Data copy: The sync of dirty pages does not need data copy any more,
>      but instead the ring is shared between the userspace and kernel by
>      page sharings (mmap() on either the vm fd or vcpu fd)
>
>    - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
>      KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
>      called KVM_RESET_DIRTY_RINGS when we want to reset the collected
>      dirty pages to protected mode again (works like
>      KVM_CLEAR_DIRTY_LOG, but ring based)
>
> And more.


Looks really interesting, I wonder if we can make this as a library then 
we can reuse it for vhost.

Thanks


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 10:38   ` Jason Wang
@ 2019-12-04 11:04     ` Paolo Bonzini
  2019-12-04 19:52       ` Peter Xu
  2019-12-10 13:25       ` Michael S. Tsirkin
  0 siblings, 2 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-04 11:04 UTC (permalink / raw)
  To: Jason Wang, Peter Xu, linux-kernel, kvm
  Cc: Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov,
	Michael S. Tsirkin

On 04/12/19 11:38, Jason Wang wrote:
>>
>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>> +    entry->slot = slot;
>> +    entry->offset = offset;
> 
> 
> Haven't gone through the whole series, sorry if it was a silly question
> but I wonder things like this will suffer from similar issue on
> virtually tagged archs as mentioned in [1].

There is no new infrastructure to track the dirty pages---it's just a
different way to pass them to userspace.

> Is this better to allocate the ring from userspace and set to KVM
> instead? Then we can use copy_to/from_user() friends (a little bit slow
> on recent CPUs).

Yeah, I don't think that would be better than mmap.

Paolo


> [1] https://lkml.org/lkml/2019/4/9/5


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 10:14     ` Paolo Bonzini
@ 2019-12-04 14:33       ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2019-12-04 14:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 04, 2019 at 11:14:19AM +0100, Paolo Bonzini wrote:
> On 03/12/19 20:13, Sean Christopherson wrote:
> > The setting of as_id is wrong, both with and without a vCPU.  as_id should
> > come from slot->as_id.
> 
> Which doesn't exist, but is an excellent suggestion nevertheless.

Huh, I explicitly looked at the code to make sure as_id existed before
making this suggestion.  No idea what code I actually pulled up.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-04 10:39 ` Jason Wang
@ 2019-12-04 19:33   ` Peter Xu
  2019-12-05  6:49     ` Jason Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-04 19:33 UTC (permalink / raw)
  To: Jason Wang, Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Michael S. Tsirkin

On Wed, Dec 04, 2019 at 06:39:48PM +0800, Jason Wang wrote:
> 
> On 2019/11/30 上午5:34, Peter Xu wrote:
> > Branch is here:https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > 
> > Overview
> > ============
> > 
> > This is a continued work from Lei Cao<lei.cao@stratus.com>  and Paolo
> > on the KVM dirty ring interface.  To make it simple, I'll still start
> > with version 1 as RFC.
> > 
> > The new dirty ring interface is another way to collect dirty pages for
> > the virtual machine, but it is different from the existing dirty
> > logging interface in a few ways, majorly:
> > 
> >    - Data format: The dirty data was in a ring format rather than a
> >      bitmap format, so the size of data to sync for dirty logging does
> >      not depend on the size of guest memory any more, but speed of
> >      dirtying.  Also, the dirty ring is per-vcpu (currently plus
> >      another per-vm ring, so total ring number is N+1), while the dirty
> >      bitmap is per-vm.
> > 
> >    - Data copy: The sync of dirty pages does not need data copy any more,
> >      but instead the ring is shared between the userspace and kernel by
> >      page sharings (mmap() on either the vm fd or vcpu fd)
> > 
> >    - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
> >      KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
> >      called KVM_RESET_DIRTY_RINGS when we want to reset the collected
> >      dirty pages to protected mode again (works like
> >      KVM_CLEAR_DIRTY_LOG, but ring based)
> > 
> > And more.
> 
> 
> Looks really interesting, I wonder if we can make this as a library then we
> can reuse it for vhost.

So iiuc this ring will majorly for (1) data exchange between kernel
and user, and (2) shared memory.  I think from that pov yeh it should
work even for vhost.

It shouldn't be hard to refactor the interfaces to avoid kvm elements,
however I'm not sure how to do that best.  Maybe like irqbypass and
put it into virt/lib/ as a standlone module?  Would it worth it?

Paolo, what's your take?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 11:04     ` Paolo Bonzini
@ 2019-12-04 19:52       ` Peter Xu
  2019-12-05  6:51         ` Jason Wang
  2019-12-10 13:25       ` Michael S. Tsirkin
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-04 19:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason Wang, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Michael S. Tsirkin

On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> On 04/12/19 11:38, Jason Wang wrote:
> >>
> >> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> >> +    entry->slot = slot;
> >> +    entry->offset = offset;
> > 
> > 
> > Haven't gone through the whole series, sorry if it was a silly question
> > but I wonder things like this will suffer from similar issue on
> > virtually tagged archs as mentioned in [1].
> 
> There is no new infrastructure to track the dirty pages---it's just a
> different way to pass them to userspace.
> 
> > Is this better to allocate the ring from userspace and set to KVM
> > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > on recent CPUs).
> 
> Yeah, I don't think that would be better than mmap.

Yeah I agree, because I didn't see how copy_to/from_user() helped to
do icache/dcache flushings...

Some context here: Jason raised this question offlist first on whether
we should also need these flush_dcache_cache() helpers for operations
like kvm dirty ring accesses.  I feel like it should, however I've got
two other questions, on:

  - if we need to do flush_dcache_page() on kernel modified pages
    (assuming the same page has mapped to userspace), then why don't
    we need flush_cache_page() too on the page, where
    flush_cache_page() is defined not-a-nop on those archs?

  - assuming an arch has not-a-nop impl for flush_[d]cache_page(),
    would atomic operations like cmpxchg really work for them
    (assuming that ISAs like cmpxchg should depend on cache
    consistency).

Sorry I think these are for sure a bit out of topic for kvm dirty ring
patchset, but since we're at it, I'm raising the questions up in case
there're answers..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-04 19:33   ` Peter Xu
@ 2019-12-05  6:49     ` Jason Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Wang @ 2019-12-05  6:49 UTC (permalink / raw)
  To: Peter Xu, Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Michael S. Tsirkin


On 2019/12/5 上午3:33, Peter Xu wrote:
> On Wed, Dec 04, 2019 at 06:39:48PM +0800, Jason Wang wrote:
>> On 2019/11/30 上午5:34, Peter Xu wrote:
>>> Branch is here:https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>>>
>>> Overview
>>> ============
>>>
>>> This is a continued work from Lei Cao<lei.cao@stratus.com>  and Paolo
>>> on the KVM dirty ring interface.  To make it simple, I'll still start
>>> with version 1 as RFC.
>>>
>>> The new dirty ring interface is another way to collect dirty pages for
>>> the virtual machine, but it is different from the existing dirty
>>> logging interface in a few ways, majorly:
>>>
>>>     - Data format: The dirty data was in a ring format rather than a
>>>       bitmap format, so the size of data to sync for dirty logging does
>>>       not depend on the size of guest memory any more, but speed of
>>>       dirtying.  Also, the dirty ring is per-vcpu (currently plus
>>>       another per-vm ring, so total ring number is N+1), while the dirty
>>>       bitmap is per-vm.
>>>
>>>     - Data copy: The sync of dirty pages does not need data copy any more,
>>>       but instead the ring is shared between the userspace and kernel by
>>>       page sharings (mmap() on either the vm fd or vcpu fd)
>>>
>>>     - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
>>>       KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
>>>       called KVM_RESET_DIRTY_RINGS when we want to reset the collected
>>>       dirty pages to protected mode again (works like
>>>       KVM_CLEAR_DIRTY_LOG, but ring based)
>>>
>>> And more.
>>
>> Looks really interesting, I wonder if we can make this as a library then we
>> can reuse it for vhost.
> So iiuc this ring will majorly for (1) data exchange between kernel
> and user, and (2) shared memory.  I think from that pov yeh it should
> work even for vhost.
>
> It shouldn't be hard to refactor the interfaces to avoid kvm elements,
> however I'm not sure how to do that best.  Maybe like irqbypass and
> put it into virt/lib/ as a standlone module?  Would it worth it?


Maybe, and it looks to me some dirty pages reporting API for VFIO is 
proposed in the same time. It will be helpful to unify them (or at least 
leave a chance for other users).

Thanks


>
> Paolo, what's your take?
>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 19:52       ` Peter Xu
@ 2019-12-05  6:51         ` Jason Wang
  2019-12-05 12:08           ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Wang @ 2019-12-05  6:51 UTC (permalink / raw)
  To: Peter Xu, Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Michael S. Tsirkin


On 2019/12/5 上午3:52, Peter Xu wrote:
> On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
>> On 04/12/19 11:38, Jason Wang wrote:
>>>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>>> +    entry->slot = slot;
>>>> +    entry->offset = offset;
>>>
>>> Haven't gone through the whole series, sorry if it was a silly question
>>> but I wonder things like this will suffer from similar issue on
>>> virtually tagged archs as mentioned in [1].
>> There is no new infrastructure to track the dirty pages---it's just a
>> different way to pass them to userspace.
>>
>>> Is this better to allocate the ring from userspace and set to KVM
>>> instead? Then we can use copy_to/from_user() friends (a little bit slow
>>> on recent CPUs).
>> Yeah, I don't think that would be better than mmap.
> Yeah I agree, because I didn't see how copy_to/from_user() helped to
> do icache/dcache flushings...


It looks to me one advantage is that exact the same VA is used by both 
userspace and kernel so there will be no alias.

Thanks


>
> Some context here: Jason raised this question offlist first on whether
> we should also need these flush_dcache_cache() helpers for operations
> like kvm dirty ring accesses.  I feel like it should, however I've got
> two other questions, on:
>
>    - if we need to do flush_dcache_page() on kernel modified pages
>      (assuming the same page has mapped to userspace), then why don't
>      we need flush_cache_page() too on the page, where
>      flush_cache_page() is defined not-a-nop on those archs?
>
>    - assuming an arch has not-a-nop impl for flush_[d]cache_page(),
>      would atomic operations like cmpxchg really work for them
>      (assuming that ISAs like cmpxchg should depend on cache
>      consistency).
>
> Sorry I think these are for sure a bit out of topic for kvm dirty ring
> patchset, but since we're at it, I'm raising the questions up in case
> there're answers..
>
> Thanks,
>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-05  6:51         ` Jason Wang
@ 2019-12-05 12:08           ` Peter Xu
  2019-12-05 13:12             ` Jason Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-05 12:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Michael S. Tsirkin

On Thu, Dec 05, 2019 at 02:51:15PM +0800, Jason Wang wrote:
> 
> On 2019/12/5 上午3:52, Peter Xu wrote:
> > On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> > > On 04/12/19 11:38, Jason Wang wrote:
> > > > > +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > > > +    entry->slot = slot;
> > > > > +    entry->offset = offset;
> > > > 
> > > > Haven't gone through the whole series, sorry if it was a silly question
> > > > but I wonder things like this will suffer from similar issue on
> > > > virtually tagged archs as mentioned in [1].
> > > There is no new infrastructure to track the dirty pages---it's just a
> > > different way to pass them to userspace.
> > > 
> > > > Is this better to allocate the ring from userspace and set to KVM
> > > > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > > > on recent CPUs).
> > > Yeah, I don't think that would be better than mmap.
> > Yeah I agree, because I didn't see how copy_to/from_user() helped to
> > do icache/dcache flushings...
> 
> 
> It looks to me one advantage is that exact the same VA is used by both
> userspace and kernel so there will be no alias.

Hmm.. but what if the page is mapped more than once in user?  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-05 12:08           ` Peter Xu
@ 2019-12-05 13:12             ` Jason Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jason Wang @ 2019-12-05 13:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Michael S. Tsirkin


On 2019/12/5 下午8:08, Peter Xu wrote:
> On Thu, Dec 05, 2019 at 02:51:15PM +0800, Jason Wang wrote:
>> On 2019/12/5 上午3:52, Peter Xu wrote:
>>> On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
>>>> On 04/12/19 11:38, Jason Wang wrote:
>>>>>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>>>>> +    entry->slot = slot;
>>>>>> +    entry->offset = offset;
>>>>> Haven't gone through the whole series, sorry if it was a silly question
>>>>> but I wonder things like this will suffer from similar issue on
>>>>> virtually tagged archs as mentioned in [1].
>>>> There is no new infrastructure to track the dirty pages---it's just a
>>>> different way to pass them to userspace.
>>>>
>>>>> Is this better to allocate the ring from userspace and set to KVM
>>>>> instead? Then we can use copy_to/from_user() friends (a little bit slow
>>>>> on recent CPUs).
>>>> Yeah, I don't think that would be better than mmap.
>>> Yeah I agree, because I didn't see how copy_to/from_user() helped to
>>> do icache/dcache flushings...
>>
>> It looks to me one advantage is that exact the same VA is used by both
>> userspace and kernel so there will be no alias.
> Hmm.. but what if the page is mapped more than once in user?  Thanks,


Then it's the responsibility of userspace program to do the flush I think.

Thanks

>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-03 13:59     ` Paolo Bonzini
@ 2019-12-05 19:30       ` Peter Xu
  2019-12-05 19:59         ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-05 19:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Tue, Dec 03, 2019 at 02:59:14PM +0100, Paolo Bonzini wrote:
> On 02/12/19 03:13, Peter Xu wrote:
> >> This is not needed, it will just be a false negative (dirty page that
> >> actually isn't dirty).  The dirty bit will be cleared when userspace
> >> resets the ring buffer; then the instruction will be executed again and
> >> mark the page dirty again.  Since ring full is not a common condition,
> >> it's not a big deal.
> > 
> > Actually I added this only because it failed one of the unit tests
> > when verifying the dirty bits..  But now after a second thought, I
> > probably agree with you that we can change the userspace too to fix
> > this.
> 
> I think there is already a similar case in dirty_log_test when a page is
> dirty but we called KVM_GET_DIRTY_LOG just before it got written to.

If you mean the host_bmap_track (in dirty_log_test.c), that should be
a reversed version of this race (that's where the data is written,
while we didn't see the dirty bit set).  But yes I think I can
probably use the same bitmap to fix the test case, because in both
cases what we want to do is to make sure "the dirty bit of this page
should be set in next round".

> 
> > I think the steps of the failed test case could be simplified into
> > something like this (assuming the QEMU migration context, might be
> > easier to understand):
> > 
> >   1. page P has data P1
> >   2. vcpu writes to page P, with date P2
> >   3. vmexit (P is still with data P1)
> >   4. mark P as dirty, ring full, user exit
> >   5. collect dirty bit P, migrate P with data P1
> >   6. vcpu run due to some reason, P was written with P2, user exit again
> >      (because ring is already reaching soft limit)
> >   7. do KVM_RESET_DIRTY_RINGS
> 
> Migration should only be done after KVM_RESET_DIRTY_RINGS (think of
> KVM_RESET_DIRTY_RINGS as the equivalent of KVM_CLEAR_DIRTY_LOG).

Totally agree for migration.  It's probably just that the test case
needs fixing.

> 
> >   dirty_log_test-29003 [001] 184503.384328: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.384329: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.384329: kvm_page_fault:       address 7fc036d000 error_code 582
> >   dirty_log_test-29003 [001] 184503.384331: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.384332: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.384332: kvm_page_fault:       address 7fc036d000 error_code 582
> >   dirty_log_test-29003 [001] 184503.384332: kvm_dirty_ring_push:  ring 1: dirty 0x37f reset 0x1c0 slot 1 offset 0x37e ret 0 (used 447)
> >   dirty_log_test-29003 [001] 184503.384333: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.384334: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.384334: kvm_page_fault:       address 7fc036e000 error_code 582
> >   dirty_log_test-29003 [001] 184503.384336: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.384336: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.384336: kvm_page_fault:       address 7fc036e000 error_code 582
> >   dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_push:  ring 1: dirty 0x380 reset 0x1c0 slot 1 offset 0x37f ret 1 (used 448)
> >   dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_exit:  vcpu 1
> >   dirty_log_test-29003 [001] 184503.384338: kvm_fpu:              unload
> >   dirty_log_test-29003 [001] 184503.384340: kvm_userspace_exit:   reason 0x1d (29)
> >   dirty_log_test-29000 [006] 184503.505103: kvm_dirty_ring_reset: ring 1: dirty 0x380 reset 0x380 (used 0)
> >   dirty_log_test-29003 [001] 184503.505184: kvm_fpu:              load
> >   dirty_log_test-29003 [001] 184503.505187: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.505193: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.505194: kvm_page_fault:       address 7fc036f000 error_code 582              <-------- [1]
> >   dirty_log_test-29003 [001] 184503.505206: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.505207: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.505207: kvm_page_fault:       address 7fc036f000 error_code 582
> >   dirty_log_test-29003 [001] 184503.505226: kvm_dirty_ring_push:  ring 1: dirty 0x381 reset 0x380 slot 1 offset 0x380 ret 0 (used 1)
> >   dirty_log_test-29003 [001] 184503.505226: kvm_entry:            vcpu 1
> >   dirty_log_test-29003 [001] 184503.505227: kvm_exit:             reason EPT_VIOLATION rip 0x40359f info 582 0
> >   dirty_log_test-29003 [001] 184503.505228: kvm_page_fault:       address 7fc0370000 error_code 582
> >   dirty_log_test-29003 [001] 184503.505231: kvm_entry:            vcpu 1
> >   ...
> > 
> > The test was trying to continuously write to pages, from above log
> > starting from 7fc036d000. The reason 0x1d (29) is the new dirty ring
> > full exit reason.
> > 
> > So far I'm still unsure of two things:
> > 
> >   1. Why for each page we faulted twice rather than once.  Take the
> >      example of page at 7fc036e000 above, the first fault didn't
> >      trigger the marking dirty path, while only until the 2nd ept
> >      violation did we trigger kvm_dirty_ring_push.
> 
> Not sure about that.  Try enabling kvmmmu tracepoints too, it will tell
> you more of the path that was taken while processing the EPT violation.

These new tracepoints are extremely useful (which I didn't notice
before).

So here's the final culprit...

void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
{
        ...
	spin_lock(&kvm->mmu_lock);
	/* FIXME: we should use a single AND operation, but there is no
	 * applicable atomic API.
	 */
	while (mask) {
		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
		mask &= mask - 1;
	}

	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
	spin_unlock(&kvm->mmu_lock);
}

The mask is cleared before reaching
kvm_arch_mmu_enable_log_dirty_pt_masked()..

The funny thing is that I did have a few more patches to even skip
allocate the dirty_bitmap when dirty ring is enabled (hence in that
tree I removed this while loop too, so that has no such problem).
However I dropped those patches when I posted the RFC because I don't
think it's mature, and the selftest didn't complain about that
either..  Though, I do plan to redo that in v2 if you don't disagree.
The major question would be whether the dirty_bitmap could still be
for any use if dirty ring is enabled.

> 
> If your machine has PML, what you're seeing is likely not-present
> violation, not dirty-protect violation.  Try disabling pml and see if
> the trace changes.
> 
> >   2. Why we didn't get the last page written again after
> >      kvm_userspace_exit (last page was 7fc036e000, and the test failed
> >      because 7fc036e000 detected change however dirty bit unset).  In
> >      this case the first write after KVM_RESET_DIRTY_RINGS is the line
> >      pointed by [1], I thought it should be a rewritten of page
> >      7fc036e000 because when the user exit happens logically the write
> >      should not happen yet and eip should keep.  However at [1] it's
> >      already writting to a new page.
> 
> IIUC you should get, with PML enabled:
> 
> - guest writes to page
> - PML marks dirty bit, causes vmexit
> - host copies PML log to ring, causes userspace exit
> - userspace calls KVM_RESET_DIRTY_RINGS
>   - host marks page as clean
> - userspace calls KVM_RUN
>   - guest writes again to page
> 
> but the page won't be in the ring until after another vmexit happens.
> Therefore, it's okay to reap the pages in the ring asynchronously, but
> there must be a synchronization point in the testcase sooner or later,
> where all CPUs are kicked out of KVM_RUN.  This synchronization point
> corresponds to the migration downtime.

Yep, currently in the test case I used the same signal trick to kick
the vcpu out to make sure PML buffers are flushed during the vmexit,
before the main thread starts to collect dirty bits.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-05 19:30       ` Peter Xu
@ 2019-12-05 19:59         ` Paolo Bonzini
  2019-12-05 20:52           ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-05 19:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 05/12/19 20:30, Peter Xu wrote:
>> Try enabling kvmmmu tracepoints too, it will tell
>> you more of the path that was taken while processing the EPT violation.
>
> These new tracepoints are extremely useful (which I didn't notice
> before).

Yes, they are!

> So here's the final culprit...
> 
> void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> {
>         ...
> 	spin_lock(&kvm->mmu_lock);
> 	/* FIXME: we should use a single AND operation, but there is no
> 	 * applicable atomic API.
> 	 */
> 	while (mask) {
> 		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> 		mask &= mask - 1;
> 	}
> 
> 	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> 	spin_unlock(&kvm->mmu_lock);
> }
> 
> The mask is cleared before reaching
> kvm_arch_mmu_enable_log_dirty_pt_masked()..

I'm not sure why that results in two vmexits?  (clearing before
kvm_arch_mmu_enable_log_dirty_pt_masked is also what
KVM_{GET,CLEAR}_DIRTY_LOG does).

> The funny thing is that I did have a few more patches to even skip
> allocate the dirty_bitmap when dirty ring is enabled (hence in that
> tree I removed this while loop too, so that has no such problem).
> However I dropped those patches when I posted the RFC because I don't
> think it's mature, and the selftest didn't complain about that
> either..  Though, I do plan to redo that in v2 if you don't disagree.
> The major question would be whether the dirty_bitmap could still be
> for any use if dirty ring is enabled.

Userspace may want a dirty bitmap in addition to a list (for example:
list for migration, bitmap for framebuffer update), but it can also do a
pass over the dirty rings in order to update an internal bitmap.

So I think it make sense to make it either one or the other.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-05 19:59         ` Paolo Bonzini
@ 2019-12-05 20:52           ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-05 20:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Thu, Dec 05, 2019 at 08:59:33PM +0100, Paolo Bonzini wrote:
> On 05/12/19 20:30, Peter Xu wrote:
> >> Try enabling kvmmmu tracepoints too, it will tell
> >> you more of the path that was taken while processing the EPT violation.
> >
> > These new tracepoints are extremely useful (which I didn't notice
> > before).
> 
> Yes, they are!

(I forgot to say thanks for teaching me that! :)

> 
> > So here's the final culprit...
> > 
> > void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > {
> >         ...
> > 	spin_lock(&kvm->mmu_lock);
> > 	/* FIXME: we should use a single AND operation, but there is no
> > 	 * applicable atomic API.
> > 	 */
> > 	while (mask) {
> > 		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > 		mask &= mask - 1;
> > 	}
> > 
> > 	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > 	spin_unlock(&kvm->mmu_lock);
> > }
> > 
> > The mask is cleared before reaching
> > kvm_arch_mmu_enable_log_dirty_pt_masked()..
> 
> I'm not sure why that results in two vmexits?  (clearing before
> kvm_arch_mmu_enable_log_dirty_pt_masked is also what
> KVM_{GET,CLEAR}_DIRTY_LOG does).

Sorry my fault to be not clear on this.

The kvm_arch_mmu_enable_log_dirty_pt_masked() only explains why the
same page is not written again after the ring-full userspace exit
(which triggered the real dirty bit missing), and that's because the
write bit is not removed during KVM_RESET_DIRTY_RINGS so the next
vmenter will directly write to the previous page without vmexit.

The two vmexits is another story - I tracked it is retried because
mmu_notifier_seq has changed, hence it goes through this path:

	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
		goto out_unlock;

It's because when try_async_pf(), we will do a writable page fault,
which probably triggers both the invalidate_range_end and change_pte
notifiers.  A reference trace when EPT enabled:

        kvm_mmu_notifier_change_pte+1
        __mmu_notifier_change_pte+82
        wp_page_copy+1907
        do_wp_page+478
        __handle_mm_fault+3395
        handle_mm_fault+196
        __get_user_pages+681
        get_user_pages_unlocked+172
        __gfn_to_pfn_memslot+290
        try_async_pf+141
        tdp_page_fault+326
        kvm_mmu_page_fault+115
        kvm_arch_vcpu_ioctl_run+2675
        kvm_vcpu_ioctl+536
        do_vfs_ioctl+1029
        ksys_ioctl+94
        __x64_sys_ioctl+22
        do_syscall_64+91

I'm not sure whether that's ideal, but it makes sense to me.

> 
> > The funny thing is that I did have a few more patches to even skip
> > allocate the dirty_bitmap when dirty ring is enabled (hence in that
> > tree I removed this while loop too, so that has no such problem).
> > However I dropped those patches when I posted the RFC because I don't
> > think it's mature, and the selftest didn't complain about that
> > either..  Though, I do plan to redo that in v2 if you don't disagree.
> > The major question would be whether the dirty_bitmap could still be
> > for any use if dirty ring is enabled.
> 
> Userspace may want a dirty bitmap in addition to a list (for example:
> list for migration, bitmap for framebuffer update), but it can also do a
> pass over the dirty rings in order to update an internal bitmap.
> 
> So I think it make sense to make it either one or the other.

Ok, then I'll do.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 10:05             ` Paolo Bonzini
@ 2019-12-07  0:29               ` Sean Christopherson
  2019-12-09  9:37                 ` Paolo Bonzini
  2019-12-09 21:54               ` Peter Xu
  1 sibling, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-07  0:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
> On 03/12/19 19:46, Sean Christopherson wrote:
> > Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
> > a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
> > dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> > returns the ring to the pool.  If pushing pages hits the soft limit, a
> > request is made to drain the ring and the ring is not returned to the pool
> > until it is drained.
> > 
> > Except for acquiring a ring, which likely can be heavily optimized, that'd
> > allow parallel processing (#1), and would provide a facsimile of #2 as
> > pushing more pages onto a ring would naturally increase the likelihood of
> > triggering a drain.  And it might be interesting to see the effect of using
> > different methods of ring selection, e.g. pure round robin, LRU, last used
> > on the current vCPU, etc...
> 
> If you are creating nr_vcpus rings, and draining is done on the vCPU
> thread that has filled the ring, why not create nr_vcpus+1?  The current
> code then is exactly the same as pre-claiming a ring per vCPU and never
> releasing it, and using a spinlock to claim the per-VM ring.

Because I really don't like kvm_get_running_vcpu() :-)

Binding the rings to vCPUs also makes for an inflexible API, e.g. the
amount of memory required for the rings scales linearly with the number of
vCPUs, or maybe there's a use case for having M:N vCPUs:rings.

That being said, I'm pretty clueless when it comes to implementing and
tuning the userspace side of this type of stuff, so feel free to ignore my
thoughts on the API.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-07  0:29               ` Sean Christopherson
@ 2019-12-09  9:37                 ` Paolo Bonzini
  0 siblings, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-09  9:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 07/12/19 01:29, Sean Christopherson wrote:
> On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
>> On 03/12/19 19:46, Sean Christopherson wrote:
>>> Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
>>> a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
>>> dirty, a vCPU claims a ring, pushes the pages into the ring, and then
>>> returns the ring to the pool.  If pushing pages hits the soft limit, a
>>> request is made to drain the ring and the ring is not returned to the pool
>>> until it is drained.
>>>
>>> Except for acquiring a ring, which likely can be heavily optimized, that'd
>>> allow parallel processing (#1), and would provide a facsimile of #2 as
>>> pushing more pages onto a ring would naturally increase the likelihood of
>>> triggering a drain.  And it might be interesting to see the effect of using
>>> different methods of ring selection, e.g. pure round robin, LRU, last used
>>> on the current vCPU, etc...
>>
>> If you are creating nr_vcpus rings, and draining is done on the vCPU
>> thread that has filled the ring, why not create nr_vcpus+1?  The current
>> code then is exactly the same as pre-claiming a ring per vCPU and never
>> releasing it, and using a spinlock to claim the per-VM ring.
> 
> Because I really don't like kvm_get_running_vcpu() :-)

I also don't like it particularly, but I think it's okay to wrap it into
a nicer API.

> Binding the rings to vCPUs also makes for an inflexible API, e.g. the
> amount of memory required for the rings scales linearly with the number of
> vCPUs, or maybe there's a use case for having M:N vCPUs:rings.

If we can get rid of the dirty bitmap, the amount of memory is probably
going to be smaller anyway.  For example at 64k per ring, 256 rings
occupy 16 MiB of memory, and that is the cost of dirty bitmaps for 512
GiB of guest memory, and that's probably what you can expect for the
memory of a 256-vCPU guest (at least roughly: if the memory is 128 GiB,
the extra 12 MiB for dirty page rings don't really matter).

Paolo

> That being said, I'm pretty clueless when it comes to implementing and
> tuning the userspace side of this type of stuff, so feel free to ignore my
> thoughts on the API.
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 10:05             ` Paolo Bonzini
  2019-12-07  0:29               ` Sean Christopherson
@ 2019-12-09 21:54               ` Peter Xu
  2019-12-10 10:07                 ` Paolo Bonzini
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-09 21:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
> On 03/12/19 19:46, Sean Christopherson wrote:
> > On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
> >> On 02/12/19 22:50, Sean Christopherson wrote:
> >>>>
> >>>> I discussed this with Paolo, but I think Paolo preferred the per-vm
> >>>> ring because there's no good reason to choose vcpu0 as what (1)
> >>>> suggested.  While if to choose (2) we probably need to lock even for
> >>>> per-cpu ring, so could be a bit slower.
> >>> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> >>> find a third option that provides comparable performance without using any
> >>> per-vcpu rings.
> >>>
> >>
> >> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
> >> the processing of dirty pages; 2) makes userspace vCPU thread do more
> >> work on vCPUs that dirty more pages.
> >>
> >> I agree that on the producer side we could reserve multiple entries in
> >> the case of PML (and without PML only one entry should be added at a
> >> time).  But I'm afraid that things get ugly when the ring is full,
> >> because you'd have to wait for all vCPUs to finish publishing the
> >> entries they have reserved.
> > 
> > Ah, I take it the intended model is that userspace will only start pulling
> > entries off the ring when KVM explicitly signals that the ring is "full"?
> 
> No, it's not.  But perhaps in the asynchronous case you can delay
> pushing the reserved entries to the consumer until a moment where no
> CPUs have left empty slots in the ring buffer (somebody must have done
> multi-producer ring buffers before).  In the ring-full case that is
> harder because it requires synchronization.
> 
> > Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
> > a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
> > dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> > returns the ring to the pool.  If pushing pages hits the soft limit, a
> > request is made to drain the ring and the ring is not returned to the pool
> > until it is drained.
> > 
> > Except for acquiring a ring, which likely can be heavily optimized, that'd
> > allow parallel processing (#1), and would provide a facsimile of #2 as
> > pushing more pages onto a ring would naturally increase the likelihood of
> > triggering a drain.  And it might be interesting to see the effect of using
> > different methods of ring selection, e.g. pure round robin, LRU, last used
> > on the current vCPU, etc...
> 
> If you are creating nr_vcpus rings, and draining is done on the vCPU
> thread that has filled the ring, why not create nr_vcpus+1?  The current
> code then is exactly the same as pre-claiming a ring per vCPU and never
> releasing it, and using a spinlock to claim the per-VM ring.
> 
> However, we could build on top of my other suggestion to add
> slot->as_id, and wrap kvm_get_running_vcpu() with a nice API, mimicking
> exactly what you've suggested.  Maybe even add a scary comment around
> kvm_get_running_vcpu() suggesting that users only do so to avoid locking
> and wrap it with a nice API.  Similar to what get_cpu/put_cpu do with
> smp_processor_id.
> 
> 1) Add a pointer from struct kvm_dirty_ring to struct
> kvm_dirty_ring_indexes:
> 
> vcpu->dirty_ring->data = &vcpu->run->vcpu_ring_indexes;
> kvm->vm_dirty_ring->data = *kvm->vm_run->vm_ring_indexes;
> 
> 2) push the ring choice and locking to two new functions
> 
> struct kvm_ring *kvm_get_dirty_ring(struct kvm *kvm)
> {
> 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> 
> 	if (vcpu && !WARN_ON_ONCE(vcpu->kvm != kvm)) {
> 		return &vcpu->dirty_ring;
> 	} else {
> 		/*
> 		 * Put onto per vm ring because no vcpu context.
> 		 * We'll kick vcpu0 if ring is full.
> 		 */
> 		spin_lock(&kvm->vm_dirty_ring->lock);
> 		return &kvm->vm_dirty_ring;
> 	}
> }
> 
> void kvm_put_dirty_ring(struct kvm *kvm,
> 			struct kvm_dirty_ring *ring)
> {
> 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> 	bool full = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> 
> 	if (ring == &kvm->vm_dirty_ring) {
> 		if (vcpu == NULL)
> 			vcpu = kvm->vcpus[0];
> 		spin_unlock(&kvm->vm_dirty_ring->lock);
> 	}
> 
> 	if (full)
> 		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> }
> 
> 3) simplify kvm_dirty_ring_push to
> 
> void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> 			 u32 slot, u64 offset)
> {
> 	/* left as an exercise to the reader */
> }
> 
> and mark_page_dirty_in_ring to
> 
> static void mark_page_dirty_in_ring(struct kvm *kvm,
> 				    struct kvm_memory_slot *slot,
> 				    gfn_t gfn)
> {
> 	struct kvm_dirty_ring *ring;
> 
> 	if (!kvm->dirty_ring_size)
> 		return;
> 
> 	ring = kvm_get_dirty_ring(kvm);
> 	kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
> 			    gfn - slot->base_gfn);
> 	kvm_put_dirty_ring(kvm, ring);
> }

I think I got the major point here.  Unless Sean has some better idea
in the future I'll go with this.

Just until recently I noticed that actually kvm_get_running_vcpu() has
a real benefit in that it gives a very solid result on whether we're
with the vcpu context, even more accurate than when we pass vcpu
pointers around (because sometimes we just passed the kvm pointer
along the stack even if we're with a vcpu context, just like what we
did with mark_page_dirty_in_slot).  I'm thinking whether I can start
to use this information in the next post on solving an issue I
encountered with the waitqueue.

Current waitqueue is still problematic in that it could wait even with
the mmu lock held when with vcpu context.

The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
the write bits, while it's the only interface to also wake up the
dirty ring sleepers.  They could dead lock like this:

      main thread                            vcpu thread
      ===========                            ===========
                                             kvm page fault
                                               mark_page_dirty_in_slot
                                               mmu lock taken
                                               mark dirty, ring full
                                               queue on waitqueue
                                               (with mmu lock)
      KVM_RESET_DIRTY_RINGS
        take mmu lock               <------------ deadlock here
        reset ring gfns
        wakeup dirty ring sleepers

And if we see if the mark_page_dirty_in_slot() is not with a vcpu
context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
cases we'll use per-vm dirty ring) then it's probably fine.

My planned solution:

- When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
  until we finished handling this page fault, probably in somewhere
  around vcpu_enter_guest, so that we can do wait_event() after the
  mmu lock released

- For per-vm ring full, I'll do what we do now (wait_event() as long
  in mark_page_dirty_in_ring) assuming it should not be with the mmu
  lock held

To achieve above, I think I really need to know exactly on whether
we're with the vcpu context, where I suppose kvm_get_running_vcpu()
would work for me then, rather than checking against vcpu pointer
passed in.

I also wanted to let KVM_RUN return immediately if either per-vm ring
or per-vcpu ring reaches softlimit always, instead of continue
execution until the next dirty ring full event.

I'd be glad to receive any early comment before I move on to these.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code
  2019-12-04  9:42     ` Paolo Bonzini
@ 2019-12-09 22:05       ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-09 22:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Wed, Dec 04, 2019 at 10:42:27AM +0100, Paolo Bonzini wrote:
> On 03/12/19 20:01, Sean Christopherson wrote:
> > In case it was clear, I strongly dislike adding kvm_get_running_vcpu().
> > IMO, it's a unnecessary hack.  The proper change to ensure a valid vCPU is
> > seen by mark_page_dirty_in_ring() when there is a current vCPU is to
> > plumb the vCPU down through the various call stacks.  Looking up the call
> > stacks for mark_page_dirty() and mark_page_dirty_in_slot(), they all
> > originate with a vcpu->kvm within a few functions, except for the rare
> > case where the write is coming from a non-vcpu ioctl(), in which case
> > there is no current vCPU.
> > 
> > The proper change is obviously much bigger in scope and would require
> > touching gobs of arch specific code, but IMO the end result would be worth
> > the effort.  E.g. there's a decent chance it would reduce the API between
> > common KVM and arch specific code by eliminating the exports of variants
> > that take "struct kvm *" instead of "struct kvm_vcpu *".
> 
> It's not that simple.  In some cases, the "struct kvm *" cannot be
> easily replaced with a "struct kvm_vcpu *" without making the API less
> intuitive; for example think of a function that takes a kvm_vcpu pointer
> but then calls gfn_to_hva(vcpu->kvm) instead of the expected
> kvm_vcpu_gfn_to_hva(vcpu).
> 
> That said, looking at the code again after a couple years I agree that
> the usage of kvm_get_running_vcpu() is ugly.  But I don't think it's
> kvm_get_running_vcpu()'s fault, rather it's the vCPU argument in
> mark_page_dirty_in_slot and mark_page_dirty_in_ring that is confusing
> and we should not be adding.
> 
> kvm_get_running_vcpu() basically means "you can use the per-vCPU ring
> and avoid locking", nothing more.  Right now we need the vCPU argument
> in mark_page_dirty_in_ring for kvm_arch_vcpu_memslots_id(vcpu), but that
> is unnecessary and is the real source of confusion (possibly bugs too)
> if it gets out of sync.
> 
> Instead, let's add an as_id field to struct kvm_memory_slot (which is
> trivial to initialize in __kvm_set_memory_region), and just do
> 
> 	as_id = slot->as_id;
> 	vcpu = kvm_get_running_vcpu();
> 
> in mark_page_dirty_in_ring.

Looks good.  I'm adding another patch for it, and dropping patch 2 then.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-09 21:54               ` Peter Xu
@ 2019-12-10 10:07                 ` Paolo Bonzini
  2019-12-10 15:52                   ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-10 10:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 09/12/19 22:54, Peter Xu wrote:
> Just until recently I noticed that actually kvm_get_running_vcpu() has
> a real benefit in that it gives a very solid result on whether we're
> with the vcpu context, even more accurate than when we pass vcpu
> pointers around (because sometimes we just passed the kvm pointer
> along the stack even if we're with a vcpu context, just like what we
> did with mark_page_dirty_in_slot).

Right, that's the point.

> I'm thinking whether I can start
> to use this information in the next post on solving an issue I
> encountered with the waitqueue.
> 
> Current waitqueue is still problematic in that it could wait even with
> the mmu lock held when with vcpu context.

I think the idea of the soft limit is that the waiting just cannot
happen.  That is, the number of dirtied pages _outside_ the guest (guest
accesses are taken care of by PML, and are subtracted from the soft
limit) cannot exceed hard_limit - (soft_limit + pml_size).

> The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
> the write bits, while it's the only interface to also wake up the
> dirty ring sleepers.  They could dead lock like this:
> 
>       main thread                            vcpu thread
>       ===========                            ===========
>                                              kvm page fault
>                                                mark_page_dirty_in_slot
>                                                mmu lock taken
>                                                mark dirty, ring full
>                                                queue on waitqueue
>                                                (with mmu lock)
>       KVM_RESET_DIRTY_RINGS
>         take mmu lock               <------------ deadlock here
>         reset ring gfns
>         wakeup dirty ring sleepers
> 
> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> cases we'll use per-vm dirty ring) then it's probably fine.
> 
> My planned solution:
> 
> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
>   until we finished handling this page fault, probably in somewhere
>   around vcpu_enter_guest, so that we can do wait_event() after the
>   mmu lock released

I think this can cause a race:

	vCPU 1			vCPU 2		host
	---------------------------------------------------------------
	mark page dirty
				write to page
						treat page as not dirty
	add page to ring

where vCPU 2 skips the clean-page slow path entirely.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-04 11:04     ` Paolo Bonzini
  2019-12-04 19:52       ` Peter Xu
@ 2019-12-10 13:25       ` Michael S. Tsirkin
  2019-12-10 13:31         ` Paolo Bonzini
  1 sibling, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-10 13:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason Wang, Peter Xu, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> On 04/12/19 11:38, Jason Wang wrote:
> >>
> >> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> >> +    entry->slot = slot;
> >> +    entry->offset = offset;
> > 
> > 
> > Haven't gone through the whole series, sorry if it was a silly question
> > but I wonder things like this will suffer from similar issue on
> > virtually tagged archs as mentioned in [1].
> 
> There is no new infrastructure to track the dirty pages---it's just a
> different way to pass them to userspace.

Did you guys consider using one of the virtio ring formats?
Maybe reusing vhost code?

If you did and it's not a good fit, this is something good to mention
in the commit log.

I also wonder about performance numbers - any data here?


> > Is this better to allocate the ring from userspace and set to KVM
> > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > on recent CPUs).
> 
> Yeah, I don't think that would be better than mmap.
> 
> Paolo
> 
> 
> > [1] https://lkml.org/lkml/2019/4/9/5


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 13:25       ` Michael S. Tsirkin
@ 2019-12-10 13:31         ` Paolo Bonzini
  2019-12-10 16:02           ` Peter Xu
  2019-12-10 21:48           ` Michael S. Tsirkin
  0 siblings, 2 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-10 13:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Peter Xu, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On 10/12/19 14:25, Michael S. Tsirkin wrote:
>> There is no new infrastructure to track the dirty pages---it's just a
>> different way to pass them to userspace.
> Did you guys consider using one of the virtio ring formats?
> Maybe reusing vhost code?

There are no used/available entries here, it's unidirectional
(kernel->user).

> If you did and it's not a good fit, this is something good to mention
> in the commit log.
> 
> I also wonder about performance numbers - any data here?

Yes some numbers would be useful.  Note however that the improvement is
asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
depending on the workload.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 10:07                 ` Paolo Bonzini
@ 2019-12-10 15:52                   ` Peter Xu
  2019-12-10 17:09                     ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-10 15:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
> > I'm thinking whether I can start
> > to use this information in the next post on solving an issue I
> > encountered with the waitqueue.
> > 
> > Current waitqueue is still problematic in that it could wait even with
> > the mmu lock held when with vcpu context.
> 
> I think the idea of the soft limit is that the waiting just cannot
> happen.  That is, the number of dirtied pages _outside_ the guest (guest
> accesses are taken care of by PML, and are subtracted from the soft
> limit) cannot exceed hard_limit - (soft_limit + pml_size).

So the question go backs to, whether this is guaranteed somehow?  Or
do you prefer us to keep the warn_on_once until it triggers then we
can analyze (which I doubt..)?

One thing to mention is that for with-vcpu cases, we probably can even
stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
reaches the softlimit, then for vcpu case it should be easier to
guarantee that.  What I want to know is the rest of cases like ioctls
or even something not from the userspace (which I think I should read
more later..).

If the answer is yes, I'd be more than glad to drop the waitqueue.

> 
> > The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
> > the write bits, while it's the only interface to also wake up the
> > dirty ring sleepers.  They could dead lock like this:
> > 
> >       main thread                            vcpu thread
> >       ===========                            ===========
> >                                              kvm page fault
> >                                                mark_page_dirty_in_slot
> >                                                mmu lock taken
> >                                                mark dirty, ring full
> >                                                queue on waitqueue
> >                                                (with mmu lock)
> >       KVM_RESET_DIRTY_RINGS
> >         take mmu lock               <------------ deadlock here
> >         reset ring gfns
> >         wakeup dirty ring sleepers
> > 
> > And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> > context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> > cases we'll use per-vm dirty ring) then it's probably fine.
> > 
> > My planned solution:
> > 
> > - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> >   until we finished handling this page fault, probably in somewhere
> >   around vcpu_enter_guest, so that we can do wait_event() after the
> >   mmu lock released
> 
> I think this can cause a race:
> 
> 	vCPU 1			vCPU 2		host
> 	---------------------------------------------------------------
> 	mark page dirty
> 				write to page
> 						treat page as not dirty
> 	add page to ring
> 
> where vCPU 2 skips the clean-page slow path entirely.

If we're still with the rule in userspace that we first do RESET then
collect and send the pages (just like what we've discussed before),
then IMHO it's fine to have vcpu2 to skip the slow path?  Because
RESET happens at "treat page as not dirty", then if we are sure that
we only collect and send pages after that point, then the latest
"write to page" data from vcpu2 won't be lost even if vcpu2 is not
blocked by vcpu1's ring full?

Maybe we can also consider to let mark_page_dirty_in_slot() return a
value, then the upper layer could have a chance to skip the spte
update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
can return directly with RET_PF_RETRY.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 13:31         ` Paolo Bonzini
@ 2019-12-10 16:02           ` Peter Xu
  2019-12-10 21:53             ` Michael S. Tsirkin
  2019-12-10 21:48           ` Michael S. Tsirkin
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-10 16:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, Jason Wang, linux-kernel, kvm,
	Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >> There is no new infrastructure to track the dirty pages---it's just a
> >> different way to pass them to userspace.
> > Did you guys consider using one of the virtio ring formats?
> > Maybe reusing vhost code?
> 
> There are no used/available entries here, it's unidirectional
> (kernel->user).

Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
100+ LOC only).

> 
> > If you did and it's not a good fit, this is something good to mention
> > in the commit log.
> > 
> > I also wonder about performance numbers - any data here?
> 
> Yes some numbers would be useful.  Note however that the improvement is
> asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> depending on the workload.

Yes.  I plan to give some numbers when start to work on the QEMU
series (after this lands).  However as Paolo said, those numbers would
probably only be with some special case where I know the dirty ring
could win.  Frankly speaking I don't even know whether we should
change the default logging mode when the QEMU work is done - I feel
like the old logging interface is still good in many major cases
(small vms, or high dirty rates).  It could be that we just offer
another option when the user could consider to solve specific problems.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 15:52                   ` Peter Xu
@ 2019-12-10 17:09                     ` Paolo Bonzini
  2019-12-15 17:21                       ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-10 17:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 10/12/19 16:52, Peter Xu wrote:
> On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
>>> I'm thinking whether I can start
>>> to use this information in the next post on solving an issue I
>>> encountered with the waitqueue.
>>>
>>> Current waitqueue is still problematic in that it could wait even with
>>> the mmu lock held when with vcpu context.
>>
>> I think the idea of the soft limit is that the waiting just cannot
>> happen.  That is, the number of dirtied pages _outside_ the guest (guest
>> accesses are taken care of by PML, and are subtracted from the soft
>> limit) cannot exceed hard_limit - (soft_limit + pml_size).
> 
> So the question go backs to, whether this is guaranteed somehow?  Or
> do you prefer us to keep the warn_on_once until it triggers then we
> can analyze (which I doubt..)?

Yes, I would like to keep the WARN_ON_ONCE just because you never know.

Of course it would be much better to audit the calls to kvm_write_guest
and figure out how many could trigger (e.g. two from the operands of an
emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.).

> One thing to mention is that for with-vcpu cases, we probably can even
> stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
> reaches the softlimit, then for vcpu case it should be easier to
> guarantee that.  What I want to know is the rest of cases like ioctls
> or even something not from the userspace (which I think I should read
> more later..).

Which ioctls?  Most ioctls shouldn't dirty memory at all.

>>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
>>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
>>> cases we'll use per-vm dirty ring) then it's probably fine.
>>>
>>> My planned solution:
>>>
>>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
>>>   until we finished handling this page fault, probably in somewhere
>>>   around vcpu_enter_guest, so that we can do wait_event() after the
>>>   mmu lock released
>>
>> I think this can cause a race:
>>
>> 	vCPU 1			vCPU 2		host
>> 	---------------------------------------------------------------
>> 	mark page dirty
>> 				write to page
>> 						treat page as not dirty
>> 	add page to ring
>>
>> where vCPU 2 skips the clean-page slow path entirely.
> 
> If we're still with the rule in userspace that we first do RESET then
> collect and send the pages (just like what we've discussed before),
> then IMHO it's fine to have vcpu2 to skip the slow path?  Because
> RESET happens at "treat page as not dirty", then if we are sure that
> we only collect and send pages after that point, then the latest
> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> blocked by vcpu1's ring full?

Good point, the race would become

 	vCPU 1			vCPU 2		host
 	---------------------------------------------------------------
 	mark page dirty
 				write to page
						reset rings
						  wait for mmu lock
 	add page to ring
	release mmu lock
						  ...do reset...
						  release mmu lock
						page is now dirty

> Maybe we can also consider to let mark_page_dirty_in_slot() return a
> value, then the upper layer could have a chance to skip the spte
> update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
> can return directly with RET_PF_RETRY.

I don't think that's possible, most writes won't come from a page fault
path and cannot retry.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 13:31         ` Paolo Bonzini
  2019-12-10 16:02           ` Peter Xu
@ 2019-12-10 21:48           ` Michael S. Tsirkin
  1 sibling, 0 replies; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-10 21:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason Wang, Peter Xu, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >> There is no new infrastructure to track the dirty pages---it's just a
> >> different way to pass them to userspace.
> > Did you guys consider using one of the virtio ring formats?
> > Maybe reusing vhost code?
> 
> There are no used/available entries here, it's unidirectional
> (kernel->user).

Didn't look at the design yet, but flow control (to prevent overflow)
goes the other way, doesn't it?  That's what used is, essentially.

> > If you did and it's not a good fit, this is something good to mention
> > in the commit log.
> > 
> > I also wonder about performance numbers - any data here?
> 
> Yes some numbers would be useful.  Note however that the improvement is
> asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> depending on the workload.
> 
> Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 16:02           ` Peter Xu
@ 2019-12-10 21:53             ` Michael S. Tsirkin
  2019-12-11  9:05               ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-10 21:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Jason Wang, linux-kernel, kvm,
	Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> > On 10/12/19 14:25, Michael S. Tsirkin wrote:
> > >> There is no new infrastructure to track the dirty pages---it's just a
> > >> different way to pass them to userspace.
> > > Did you guys consider using one of the virtio ring formats?
> > > Maybe reusing vhost code?
> > 
> > There are no used/available entries here, it's unidirectional
> > (kernel->user).
> 
> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
> 100+ LOC only).


I guess you don't do polling/ event suppression and other tricks that
virtio came up with for speed then? Why won't they be helpful for kvm?
To put it another way, LOC is irrelevant, virtio is already in the
kernel.

Anyway, this is something to be discussed in the cover letter.

> > 
> > > If you did and it's not a good fit, this is something good to mention
> > > in the commit log.
> > > 
> > > I also wonder about performance numbers - any data here?
> > 
> > Yes some numbers would be useful.  Note however that the improvement is
> > asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> > depending on the workload.
> 
> Yes.  I plan to give some numbers when start to work on the QEMU
> series (after this lands).  However as Paolo said, those numbers would
> probably only be with some special case where I know the dirty ring
> could win.  Frankly speaking I don't even know whether we should
> change the default logging mode when the QEMU work is done - I feel
> like the old logging interface is still good in many major cases
> (small vms, or high dirty rates).  It could be that we just offer
> another option when the user could consider to solve specific problems.
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 21:53             ` Michael S. Tsirkin
@ 2019-12-11  9:05               ` Paolo Bonzini
  2019-12-11 13:04                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-11  9:05 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: Jason Wang, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On 10/12/19 22:53, Michael S. Tsirkin wrote:
> On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
>> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
>>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
>>>>> There is no new infrastructure to track the dirty pages---it's just a
>>>>> different way to pass them to userspace.
>>>> Did you guys consider using one of the virtio ring formats?
>>>> Maybe reusing vhost code?
>>>
>>> There are no used/available entries here, it's unidirectional
>>> (kernel->user).
>>
>> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
>> 100+ LOC only).
> 
> I guess you don't do polling/ event suppression and other tricks that
> virtio came up with for speed then?

There are no interrupts either, so no need for event suppression.  You
have vmexits when the ring gets full (and that needs to be synchronous),
but apart from that the migration thread will poll the rings once when
it needs to send more pages.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
                     ` (2 preceding siblings ...)
  2019-12-04 10:38   ` Jason Wang
@ 2019-12-11 12:53   ` Michael S. Tsirkin
  2019-12-11 14:14     ` Paolo Bonzini
  2019-12-11 20:59     ` Peter Xu
  2019-12-11 17:24   ` Christophe de Dinechin
  4 siblings, 2 replies; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-11 12:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> 
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
> 
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
> 
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
> 
> We defined two new data structures:
> 
>   struct kvm_dirty_ring;
>   struct kvm_dirty_ring_indexes;
> 
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
> 
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring.  Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
> 
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
> 
> Currently, we have N+1 rings for each VM of N vcpus:
> 
>   - for each vcpu, we have 1 per-vcpu dirty ring,
>   - for each vm, we have 1 per-vm dirty ring
> 
> Please refer to the documentation update in this patch for more
> details.
> 
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now.  Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
> 
> [1] https://patchwork.kernel.org/patch/10471409/
> 
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>


Thanks, that's interesting.

> ---
>  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
>  arch/x86/kvm/Makefile          |   3 +-
>  include/linux/kvm_dirty_ring.h |  67 +++++++++
>  include/linux/kvm_host.h       |  33 +++++
>  include/linux/kvm_types.h      |   1 +
>  include/uapi/linux/kvm.h       |  36 +++++
>  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
>  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
>  8 files changed, 642 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  create mode 100644 virt/kvm/dirty_ring.c
> 
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 49183add44e7..fa622c9a2eb8 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>  It is thus encouraged to use the vm ioctl to query for capabilities (available
>  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>  
> +
>  4.5 KVM_GET_VCPU_MMAP_SIZE
>  
>  Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>  memory region.  This ioctl returns the size of that region.  See the
>  KVM_RUN documentation for details.
>  
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.
> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>  
>  4.6 KVM_SET_MEMORY_REGION
>  

PAGE_SIZE being which value? It's not always trivial for
userspace to know what's the PAGE_SIZE for the kernel ...


> @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>  
>  See KVM_CAP_VCPU_EVENTS for more details.
> +
>  8.20 KVM_CAP_HYPERV_SEND_IPI
>  
>  Architectures: x86
> @@ -5365,6 +5379,7 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>  
>  Architecture: x86
> @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>  flush hypercalls by Hyper-V) so userspace should disable KVM identification
>  in CPUID and only exposes Hyper-V identification. In this case, guest
>  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu and one global
> +ring per vm.
> +
> +One dirty ring has the following two major structures:
> +
> +struct kvm_dirty_ring {
> +	u16 dirty_index;
> +	u16 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */

Sticking these next to each other seems to guarantee cache conflicts.

Avail/Fetch seems to mimic Virtio's avail/used exactly.  I am not saying
you must reuse the code really, but I think you should take a hard look
at e.g. the virtio packed ring structure. We spent a bunch of time
optimizing it for cache utilization. It seems kernel is the driver,
making entries available, and userspace the device, using them.
Again let's not develop a thread about this, but I think
this is something to consider and discuss in future versions
of the patches.


> +};
> +
> +While for each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {

What does GFN stand for?

> +        __u32 pad;
> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};

offset of what? a 4K page right? Seems like a waste e.g. for
hugetlbfs... How about replacing pad with size instead?

> +
> +The fields in kvm_dirty_ring will be only internal to KVM itself,
> +while the fields in kvm_dirty_ring_indexes will be exposed to
> +userspace to be either read or written.

I'm not sure what you are trying to say here. kvm_dirty_gfn
seems to be part of UAPI.

> +
> +The two indices in the ring buffer are free running counters.
> +
> +In pseudocode, processing the ring buffer looks like this:
> +
> +	idx = load-acquire(&ring->fetch_index);
> +	while (idx != ring->avail_index) {
> +		struct kvm_dirty_gfn *entry;
> +		entry = &ring->dirty_gfns[idx & (size - 1)];
> +		...
> +
> +		idx++;
> +	}
> +	ring->fetch_index = idx;
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.

All these seem like arbitrary limitations to me.

Sizing the ring correctly might prove to be a challenge.

Thus I think there's value in resizing the rings
without destroying VCPU.

Also, power of two just saves a branch here and there,
but wastes lots of memory. Just wrap the index around to
0 and then users can select any size?



>  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

OTOH larger buffers put lots of pressure on the system cache.


> +
> +After the capability is enabled, userspace can mmap the global ring
> +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> +KVM_DIRTY_LOG_PAGE_OFFSET).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.
> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer.  To achieve that, one
> +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> +should pause all the vcpus, then harvest all the dirty pages and
> +rearm the dirty traps. It can unpause the guest after that.

This last item means that the performance impact of the feature is
really hard to predict. Can improve some workloads drastically. Or can
slow some down.


One solution could be to actually allow using this together with the
existing bitmap. Userspace can then decide whether it wants to block
VCPU on ring full, or just record ring full condition and recover by
bitmap scanning.


> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>  KVM := ../../../virt/kvm
>  
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>  
>  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..8335635b7ff7
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,67 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/*
> + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> + *
> + * dirty_ring:  shared with userspace via mmap. It is the compact list
> + *              that holds the dirty pages.
> + * dirty_index: free running counter that points to the next slot in
> + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> + * reset_index: free running counter that points to the next dirty page
> + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> + *              be reenabled
> + * size:        size of the compact list, dirty_ring->dirty_gfns
> + * soft_limit:  when the number of dirty pages in the list reaches this
> + *              limit, vcpu that owns this ring should exit to userspace
> + *              to allow userspace to harvest all the dirty pages
> + * lock:        protects dirty_ring, only in use if this is the global
> + *              ring
> + *
> + * The number of dirty pages in the ring is calculated by,
> + * dirty_index - reset_index
> + *
> + * kernel increments dirty_ring->indices.avail_index after dirty index
> + * is incremented. When userspace harvests the dirty pages, it increments
> + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> + * When kernel reenables dirty traps for the dirty pages, it increments
> + * reset_index up to dirty_ring->indices.fetch_index.
> + *
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes);
> +
> +/*
> + * returns 0: successfully pushed
> + *         1: successfully pushed, soft limit reached,
> + *            vcpu should exit to userspace
> + *         -EBUSY: unable to push, dirty ring full.
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 498a39462ac1..7b747bc9ff3e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>  #include <linux/kvm_types.h>
>  
>  #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>  
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_PENDING_TIMER     2
>  #define KVM_REQ_UNHALT            3
> +#define KVM_REQ_DIRTY_RING_FULL   4
>  #define KVM_REQUEST_ARCH_BASE     8
>  
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -321,6 +323,7 @@ struct kvm_vcpu {
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>  };
>  
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -501,6 +504,10 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> +	struct kvm_vm_run *vm_run;
> +	u32 dirty_ring_size;
> +	struct kvm_dirty_ring vm_dirty_ring;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  					gfn_t gfn_offset,
>  					unsigned long mask);
>  
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  				struct kvm_dirty_log *log);
>  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>  
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures,

Confused. Offset where? You set a default for everyone - where does arch
want to override it?

> while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +#ifndef KVM_DIRTY_RING_VERSION
> +#define KVM_DIRTY_RING_VERSION 0
> +#endif

One way versioning, with no bits and negotiation
will make it hard to change down the road.
what's wrong with existing KVM capabilities that
you feel there's a need for dedicated versioning for this?

> +
>  #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1c88e69db3d9..d9d03eea145a 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
>  struct kvm_memory_slot;
>  struct kvm_one_reg;
>  struct kvm_run;
> +struct kvm_vm_run;
>  struct kvm_userspace_memory_region;
>  struct kvm_vcpu;
>  struct kvm_vcpu_init;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6f17c8e2dba..0b88d76d6215 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_DIRTY_RING_FULL  29
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
>  /* Encounter unexpected vm-exit reason */
>  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>  
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -421,6 +427,13 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> +};
> +
> +/* Returned by mmap(kvm->fd, offset=0) */
> +struct kvm_vm_run {
> +	struct kvm_dirty_ring_indexes vm_ring_indexes;
>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
>  #define KVM_CAP_ARM_NISV_TO_USER 177
>  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>  
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
>  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>  
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 size = kvm->dirty_ring_size;
> +
> +	ring->dirty_gfns = vmalloc(size);

So 1/2 a megabyte of kernel memory per VM that userspace locks up.
Do we really have to though? Why not get a userspace pointer,
write it with copy to user, and sidestep all this?

> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit =
> +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> +	    kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	spin_lock_init(&ring->lock);
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +
> +	fetch = READ_ONCE(indexes->fetch_index);
> +	if (fetch == ring->reset_index)
> +		return 0;
> +
> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +	/*
> +	 * The ring buffer is shared with userspace, which might mmap
> +	 * it and concurrently modify slot and offset.  Userspace must
> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> +	 * the values after they've been range-checked (the checks are
> +	 * in kvm_reset_dirty_gfn).

What it doesn't is prevent speculative attacks.  That's why things like
copy from user have a speculation barrier.  Instead of worrying about
that, unless it's really critical, I think you'd do well do just use
copy to/from user.

> +	 */
> +	smp_read_barrier_depends();

What depends on what here? Looks suspicious ...

> +	cur_slot = READ_ONCE(entry->slot);
> +	cur_offset = READ_ONCE(entry->offset);
> +	mask = 1;
> +	count++;
> +	ring->reset_index++;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		smp_read_barrier_depends();

same concerns here

> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);
> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.

what does guest scanning mean?

> +		 */
> +		if (next_slot == cur_slot) {
> +			int delta = next_offset - cur_offset;
> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +	return count;
> +}
> +
> +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return ring->dirty_index - ring->reset_index;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +/*
> + * Returns:
> + *   >0 if we should kick the vcpu out,
> + *   =0 if the gfn pushed successfully, or,
> + *   <0 if error (e.g. ring full)
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock)
> +{
> +	int ret;
> +	struct kvm_dirty_gfn *entry;
> +
> +	if (lock)
> +		spin_lock(&ring->lock);

what's the story around locking here? Why is it safe
not to take the lock sometimes?

> +
> +	if (kvm_dirty_ring_full(ring)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +	pr_info("%s: slot %u offset %llu used %u\n",
> +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:
> +	if (lock)
> +		spin_unlock(&ring->lock);
> +
> +	return ret;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	if (ring->dirty_gfns) {
> +		vfree(ring->dirty_gfns);
> +		ring->dirty_gfns = NULL;
> +	}
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>  
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>  
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_vcpu *vcpu,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>  
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>  
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> +		if (r) {
> +			kvm->dirty_ring_size = 0;
> +			goto fail_free_run;
> +		}
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>  
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>  
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	struct kvm *kvm = kvm_arch_alloc_vm();
>  	int r = -ENOMEM;
>  	int i;
> +	struct page *page;
>  
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>  
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		r = -ENOMEM;
> +		goto out_err_alloc_page;
> +	}
> +	kvm->vm_run = page_address(page);

So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
still. What is wrong with just a pointer and calling put_user?

> +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_no_srcu;
>  	if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_no_srcu:
> +	free_page((unsigned long)page);
> +	kvm->vm_run = NULL;
> +out_err_alloc_page:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
>  	return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>  
> +	if (kvm->dirty_ring_size) {
> +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> +	}
> +
> +	if (kvm->vm_run) {
> +		free_page((unsigned long)kvm->vm_run);
> +		kvm->vm_run = NULL;
> +	}
> +
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
>  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>  
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>  #endif
> +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>  	else
>  		return kvm_arch_vcpu_fault(vcpu, vmf);
>  	get_page(page);
> @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #endif
>  	case KVM_CAP_NR_MEMSLOTS:
>  		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		/* Version will be zero if arch didn't implement it */
> +		return KVM_DIRTY_RING_VERSION;
>  	default:
>  		break;
>  	}
>  	return kvm_vm_ioctl_check_extension(kvm, arg);
>  }
>  
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	u32 as_id = 0;
> +	u64 offset;
> +	int ret;
> +	struct kvm_dirty_ring *ring;
> +	struct kvm_dirty_ring_indexes *indexes;
> +	bool is_vm_ring;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	if (vcpu) {
> +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	} else {
> +		as_id = 0;
> +		vcpu = kvm_get_running_vcpu();
> +	}
> +
> +	if (vcpu) {
> +		ring = &vcpu->dirty_ring;
> +		indexes = &vcpu->run->vcpu_ring_indexes;
> +		is_vm_ring = false;
> +	} else {
> +		/*
> +		 * Put onto per vm ring because no vcpu context.  Kick
> +		 * vcpu0 if ring is full.

What about tasks on vcpu 0? Do guests realize it's a bad idea to put
critical tasks there, they will be penalized disproportionally?

> +		 */
> +		vcpu = kvm->vcpus[0];
> +		ring = &kvm->vm_dirty_ring;
> +		indexes = &kvm->vm_run->vm_ring_indexes;
> +		is_vm_ring = true;
> +	}
> +
> +	ret = kvm_dirty_ring_push(ring, indexes,
> +				  (as_id << 16)|slot->id, offset,
> +				  is_vm_ring);
> +	if (ret < 0) {
> +		if (is_vm_ring)
> +			pr_warn_once("vcpu %d dirty log overflow\n",
> +				     vcpu->vcpu_id);
> +		else
> +			pr_warn_once("per-vm dirty log overflow\n");
> +		return;
> +	}
> +
> +	if (ret)
> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	/* FIXME: we should use a single AND operation, but there is no
> +	 * applicable atomic API.
> +	 */
> +	while (mask) {
> +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> +		mask &= mask - 1;
> +	}
> +
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;

KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
So how does userspace know what's legal?
Do you expect it to just try?
More likely it will just copy the number from kernel and can
never ever make it smaller.

> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
> +		if (r) {
> +			/* Unset dirty ring */
> +			kvm->dirty_ring_size = 0;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
> +					&kvm->vm_run->vm_ring_indexes);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
> +						&vcpu->run->vcpu_ring_indexes);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	return cleared;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		kvm->manual_dirty_log_protect = cap->args[0];
>  		return 0;
>  #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  	case KVM_CHECK_EXTENSION:
>  		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>  		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
> @@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  }
>  #endif
>  
> +static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
> +{
> +	struct kvm *kvm = vmf->vma->vm_file->private_data;
> +	struct page *page = NULL;
> +
> +	if (vmf->pgoff == 0)
> +		page = virt_to_page(kvm->vm_run);
> +	else if (kvm_fault_in_dirty_ring(kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &kvm->vm_dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> +	else
> +		return VM_FAULT_SIGBUS;
> +
> +	get_page(page);
> +	vmf->page = page;
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_vm_vm_ops = {
> +	.fault = kvm_vm_fault,
> +};
> +
> +static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	vma->vm_ops = &kvm_vm_vm_ops;
> +	return 0;
> +}
> +
>  static struct file_operations kvm_vm_fops = {
>  	.release        = kvm_vm_release,
>  	.unlocked_ioctl = kvm_vm_ioctl,
> +	.mmap           = kvm_vm_mmap,
>  	.llseek		= noop_llseek,
>  	KVM_COMPAT(kvm_vm_compat_ioctl),
>  };
> -- 
> 2.21.0


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11  9:05               ` Paolo Bonzini
@ 2019-12-11 13:04                 ` Michael S. Tsirkin
  2019-12-11 14:54                   ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-11 13:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, Jason Wang, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 11, 2019 at 10:05:28AM +0100, Paolo Bonzini wrote:
> On 10/12/19 22:53, Michael S. Tsirkin wrote:
> > On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> >> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> >>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >>>>> There is no new infrastructure to track the dirty pages---it's just a
> >>>>> different way to pass them to userspace.
> >>>> Did you guys consider using one of the virtio ring formats?
> >>>> Maybe reusing vhost code?
> >>>
> >>> There are no used/available entries here, it's unidirectional
> >>> (kernel->user).
> >>
> >> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
> >> 100+ LOC only).
> > 
> > I guess you don't do polling/ event suppression and other tricks that
> > virtio came up with for speed then?

I looked at the code finally, there's actually available, and fetched is
exactly like used. Not saying existing code is a great fit for you as
you have an extra slot parameter to pass and it's reversed as compared
to vhost, with kernel being the driver and userspace the device (even
though vringh might fit, yet needs to be updated to support packed rings
though).  But sticking to an existing format is a good idea IMHO,
or if not I think it's not a bad idea to add some justification.

> There are no interrupts either, so no need for event suppression.  You
> have vmexits when the ring gets full (and that needs to be synchronous),
> but apart from that the migration thread will poll the rings once when
> it needs to send more pages.
> 
> Paolo

OK don't use that then.

-- 
MST


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
                   ` (16 preceding siblings ...)
  2019-12-04 10:39 ` Jason Wang
@ 2019-12-11 13:41 ` Christophe de Dinechin
  2019-12-11 14:16   ` Paolo Bonzini
  17 siblings, 1 reply; 123+ messages in thread
From: Christophe de Dinechin @ 2019-12-11 13:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov


Peter Xu writes:

> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>
> Overview
> ============
>
> This is a continued work from Lei Cao <lei.cao@stratus.com> and Paolo
> on the KVM dirty ring interface.  To make it simple, I'll still start
> with version 1 as RFC.
>
> The new dirty ring interface is another way to collect dirty pages for
> the virtual machine, but it is different from the existing dirty
> logging interface in a few ways, majorly:
>
>   - Data format: The dirty data was in a ring format rather than a
>     bitmap format, so the size of data to sync for dirty logging does
>     not depend on the size of guest memory any more, but speed of
>     dirtying.  Also, the dirty ring is per-vcpu (currently plus
>     another per-vm ring, so total ring number is N+1), while the dirty
>     bitmap is per-vm.

I like Sean's suggestion to fetch rings when dirtying. That could reduce
the number of dirty rings to examine.

Also, as is, this means that the same gfn may be present in multiple
rings, right?

>
>   - Data copy: The sync of dirty pages does not need data copy any more,
>     but instead the ring is shared between the userspace and kernel by
>     page sharings (mmap() on either the vm fd or vcpu fd)
>
>   - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
>     KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
>     called KVM_RESET_DIRTY_RINGS when we want to reset the collected
>     dirty pages to protected mode again (works like
>     KVM_CLEAR_DIRTY_LOG, but ring based)
>
> And more.
>
> I would appreciate if the reviewers can start with patch "KVM:
> Implement ring-based dirty memory tracking", especially the document
> update part for the big picture.  Then I'll avoid copying into most of
> them into cover letter again.
>
> I marked this series as RFC because I'm at least uncertain on this
> change of vcpu_enter_guest():
>
>         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
>                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>                 /*
>                         * If this is requested, it means that we've
>                         * marked the dirty bit in the dirty ring BUT
>                         * we've not written the date.  Do it now.

not written the "data" ?

>                         */
>                 r = kvm_emulate_instruction(vcpu, 0);
>                 r = r >= 0 ? 0 : r;
>                 goto out;
>         }
>
> I did a kvm_emulate_instruction() when dirty ring reaches softlimit
> and want to exit to userspace, however I'm not really sure whether
> there could have any side effect.  I'd appreciate any comment of
> above, or anything else.
>
> Tests
> ===========
>
> I wanted to continue work on the QEMU part, but after I noticed that
> the interface might still prone to change, I posted this series first.
> However to make sure it's at least working, I've provided unit tests
> together with the series.  The unit tests should be able to test the
> series in at least three major paths:
>
>   (1) ./dirty_log_test -M dirty-ring
>
>       This tests async ring operations: this should be the major work
>       mode for the dirty ring interface, say, when the kernel is
>       queuing more data, the userspace is collecting too.  Ring can
>       hardly reaches full when working like this, because in most
>       cases the collection could be fast.
>
>   (2) ./dirty_log_test -M dirty-ring -c 1024
>
>       This set the ring size to be very small so that ring soft-full
>       always triggers (soft-full is a soft limit of the ring state,
>       when the dirty ring reaches the soft limit it'll do a userspace
>       exit and let the userspace to collect the data).
>
>   (3) ./dirty_log_test -M dirty-ring-wait-queue
>
>       This sololy test the extreme case where ring is full.  When the
>       ring is completely full, the thread (no matter vcpu or not) will
>       be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
>       wake the threads up (assuming until which the ring will not be
>       full any more).

Am I correct assuming that guest memory can be dirtied by DMA operations?
Should

Not being that familiar with the current implementation of dirty page
tracking, I wonder who marks the pages dirty in that case, and when?
If the VM ring is used for I/O threads, isn't it possible that a large
DMA could dirty a sufficiently large number of GFNs to overflow the
associated ring? Does this case need a separate way to queue the
dirtying I/O thread?

>
> Thanks,
>
> Cao, Lei (2):
>   KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
>   KVM: X86: Implement ring-based dirty memory tracking
>
> Paolo Bonzini (1):
>   KVM: Move running VCPU from ARM to common code
>
> Peter Xu (12):
>   KVM: Add build-time error check on kvm_run size
>   KVM: Implement ring-based dirty memory tracking
>   KVM: Make dirty ring exclusive to dirty bitmap log
>   KVM: Introduce dirty ring wait queue
>   KVM: selftests: Always clear dirty bitmap after iteration
>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>   KVM: selftests: Use a single binary for dirty/clear log test
>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>   KVM: selftests: Add dirty ring buffer test
>   KVM: selftests: Let dirty_log_test async for dirty ring test
>   KVM: selftests: Add "-c" parameter to dirty log test
>   KVM: selftests: Test dirty ring waitqueue
>
>  Documentation/virt/kvm/api.txt                | 116 +++++
>  arch/arm/include/asm/kvm_host.h               |   2 -
>  arch/arm64/include/asm/kvm_host.h             |   2 -
>  arch/x86/include/asm/kvm_host.h               |   5 +
>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>  arch/x86/kvm/Makefile                         |   3 +-
>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>  arch/x86/kvm/vmx/vmx.c                        |   7 +
>  arch/x86/kvm/x86.c                            |  12 +
>  include/linux/kvm_dirty_ring.h                |  67 +++
>  include/linux/kvm_host.h                      |  37 ++
>  include/linux/kvm_types.h                     |   1 +
>  include/uapi/linux/kvm.h                      |  36 ++
>  tools/include/uapi/linux/kvm.h                |  47 ++
>  tools/testing/selftests/kvm/Makefile          |   2 -
>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>  tools/testing/selftests/kvm/dirty_log_test.c  | 452 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   6 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 103 ++++
>  .../selftests/kvm/lib/kvm_util_internal.h     |   5 +
>  virt/kvm/arm/arm.c                            |  29 --
>  virt/kvm/arm/perf.c                           |   6 +-
>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>  virt/kvm/dirty_ring.c                         | 156 ++++++
>  virt/kvm/kvm_main.c                           | 315 +++++++++++-
>  25 files changed, 1329 insertions(+), 104 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>  create mode 100644 virt/kvm/dirty_ring.c


--
Cheers,
Christophe de Dinechin (IRC c3d)


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11 12:53   ` Michael S. Tsirkin
@ 2019-12-11 14:14     ` Paolo Bonzini
  2019-12-11 20:59     ` Peter Xu
  1 sibling, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-11 14:14 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 11/12/19 13:53, Michael S. Tsirkin wrote:
>> +
>> +struct kvm_dirty_ring_indexes {
>> +	__u32 avail_index; /* set by kernel */
>> +	__u32 fetch_index; /* set by userspace */
>
> Sticking these next to each other seems to guarantee cache conflicts.

I don't think that's an issue because you'd have a conflict anyway on
the actual entry; userspace anyway has to read the kernel-written index,
which will cause cache traffic.

> Avail/Fetch seems to mimic Virtio's avail/used exactly.

No, avail_index/fetch_index is just the producer and consumer indices
respectively.  There is only one ring buffer, not two as in virtio.

> I am not saying
> you must reuse the code really, but I think you should take a hard look
> at e.g. the virtio packed ring structure. We spent a bunch of time
> optimizing it for cache utilization. It seems kernel is the driver,
> making entries available, and userspace the device, using them.
> Again let's not develop a thread about this, but I think
> this is something to consider and discuss in future versions
> of the patches.

Even in the packed ring you have two cache lines accessed, one for the
index and one for the descriptor.  Here you have one, because the data
is embedded in the ring buffer.

> 
>> +};
>> +
>> +While for each of the dirty entry it's defined as:
>> +
>> +struct kvm_dirty_gfn {
> 
> What does GFN stand for?
> 
>> +        __u32 pad;
>> +        __u32 slot; /* as_id | slot_id */
>> +        __u64 offset;
>> +};
> 
> offset of what? a 4K page right? Seems like a waste e.g. for
> hugetlbfs... How about replacing pad with size instead?

No, it's an offset in the memslot (which will usually be >4GB for any VM
with bigger memory than that).

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-11 13:41 ` Christophe de Dinechin
@ 2019-12-11 14:16   ` Paolo Bonzini
  2019-12-11 17:15     ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-11 14:16 UTC (permalink / raw)
  To: Christophe de Dinechin, Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 11/12/19 14:41, Christophe de Dinechin wrote:
> 
> Peter Xu writes:
> 
>> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>>
>> Overview
>> ============
>>
>> This is a continued work from Lei Cao <lei.cao@stratus.com> and Paolo
>> on the KVM dirty ring interface.  To make it simple, I'll still start
>> with version 1 as RFC.
>>
>> The new dirty ring interface is another way to collect dirty pages for
>> the virtual machine, but it is different from the existing dirty
>> logging interface in a few ways, majorly:
>>
>>   - Data format: The dirty data was in a ring format rather than a
>>     bitmap format, so the size of data to sync for dirty logging does
>>     not depend on the size of guest memory any more, but speed of
>>     dirtying.  Also, the dirty ring is per-vcpu (currently plus
>>     another per-vm ring, so total ring number is N+1), while the dirty
>>     bitmap is per-vm.
> 
> I like Sean's suggestion to fetch rings when dirtying. That could reduce
> the number of dirty rings to examine.

What do you mean by "fetch rings"?

> Also, as is, this means that the same gfn may be present in multiple
> rings, right?

I think the actual marking of a page as dirty is protected by a spinlock
but I will defer to Peter on this.

Paolo

>>
>>   - Data copy: The sync of dirty pages does not need data copy any more,
>>     but instead the ring is shared between the userspace and kernel by
>>     page sharings (mmap() on either the vm fd or vcpu fd)
>>
>>   - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
>>     KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
>>     called KVM_RESET_DIRTY_RINGS when we want to reset the collected
>>     dirty pages to protected mode again (works like
>>     KVM_CLEAR_DIRTY_LOG, but ring based)
>>
>> And more.
>>
>> I would appreciate if the reviewers can start with patch "KVM:
>> Implement ring-based dirty memory tracking", especially the document
>> update part for the big picture.  Then I'll avoid copying into most of
>> them into cover letter again.
>>
>> I marked this series as RFC because I'm at least uncertain on this
>> change of vcpu_enter_guest():
>>
>>         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
>>                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>>                 /*
>>                         * If this is requested, it means that we've
>>                         * marked the dirty bit in the dirty ring BUT
>>                         * we've not written the date.  Do it now.
> 
> not written the "data" ?
> 
>>                         */
>>                 r = kvm_emulate_instruction(vcpu, 0);
>>                 r = r >= 0 ? 0 : r;
>>                 goto out;
>>         }
>>
>> I did a kvm_emulate_instruction() when dirty ring reaches softlimit
>> and want to exit to userspace, however I'm not really sure whether
>> there could have any side effect.  I'd appreciate any comment of
>> above, or anything else.
>>
>> Tests
>> ===========
>>
>> I wanted to continue work on the QEMU part, but after I noticed that
>> the interface might still prone to change, I posted this series first.
>> However to make sure it's at least working, I've provided unit tests
>> together with the series.  The unit tests should be able to test the
>> series in at least three major paths:
>>
>>   (1) ./dirty_log_test -M dirty-ring
>>
>>       This tests async ring operations: this should be the major work
>>       mode for the dirty ring interface, say, when the kernel is
>>       queuing more data, the userspace is collecting too.  Ring can
>>       hardly reaches full when working like this, because in most
>>       cases the collection could be fast.
>>
>>   (2) ./dirty_log_test -M dirty-ring -c 1024
>>
>>       This set the ring size to be very small so that ring soft-full
>>       always triggers (soft-full is a soft limit of the ring state,
>>       when the dirty ring reaches the soft limit it'll do a userspace
>>       exit and let the userspace to collect the data).
>>
>>   (3) ./dirty_log_test -M dirty-ring-wait-queue
>>
>>       This sololy test the extreme case where ring is full.  When the
>>       ring is completely full, the thread (no matter vcpu or not) will
>>       be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
>>       wake the threads up (assuming until which the ring will not be
>>       full any more).
> 
> Am I correct assuming that guest memory can be dirtied by DMA operations?
> Should
> 
> Not being that familiar with the current implementation of dirty page
> tracking, I wonder who marks the pages dirty in that case, and when?
> If the VM ring is used for I/O threads, isn't it possible that a large
> DMA could dirty a sufficiently large number of GFNs to overflow the
> associated ring? Does this case need a separate way to queue the
> dirtying I/O thread?
> 
>>
>> Thanks,
>>
>> Cao, Lei (2):
>>   KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
>>   KVM: X86: Implement ring-based dirty memory tracking
>>
>> Paolo Bonzini (1):
>>   KVM: Move running VCPU from ARM to common code
>>
>> Peter Xu (12):
>>   KVM: Add build-time error check on kvm_run size
>>   KVM: Implement ring-based dirty memory tracking
>>   KVM: Make dirty ring exclusive to dirty bitmap log
>>   KVM: Introduce dirty ring wait queue
>>   KVM: selftests: Always clear dirty bitmap after iteration
>>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>>   KVM: selftests: Use a single binary for dirty/clear log test
>>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>>   KVM: selftests: Add dirty ring buffer test
>>   KVM: selftests: Let dirty_log_test async for dirty ring test
>>   KVM: selftests: Add "-c" parameter to dirty log test
>>   KVM: selftests: Test dirty ring waitqueue
>>
>>  Documentation/virt/kvm/api.txt                | 116 +++++
>>  arch/arm/include/asm/kvm_host.h               |   2 -
>>  arch/arm64/include/asm/kvm_host.h             |   2 -
>>  arch/x86/include/asm/kvm_host.h               |   5 +
>>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>>  arch/x86/kvm/Makefile                         |   3 +-
>>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>>  arch/x86/kvm/vmx/vmx.c                        |   7 +
>>  arch/x86/kvm/x86.c                            |  12 +
>>  include/linux/kvm_dirty_ring.h                |  67 +++
>>  include/linux/kvm_host.h                      |  37 ++
>>  include/linux/kvm_types.h                     |   1 +
>>  include/uapi/linux/kvm.h                      |  36 ++
>>  tools/include/uapi/linux/kvm.h                |  47 ++
>>  tools/testing/selftests/kvm/Makefile          |   2 -
>>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>>  tools/testing/selftests/kvm/dirty_log_test.c  | 452 ++++++++++++++++--
>>  .../testing/selftests/kvm/include/kvm_util.h  |   6 +
>>  tools/testing/selftests/kvm/lib/kvm_util.c    | 103 ++++
>>  .../selftests/kvm/lib/kvm_util_internal.h     |   5 +
>>  virt/kvm/arm/arm.c                            |  29 --
>>  virt/kvm/arm/perf.c                           |   6 +-
>>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>>  virt/kvm/dirty_ring.c                         | 156 ++++++
>>  virt/kvm/kvm_main.c                           | 315 +++++++++++-
>>  25 files changed, 1329 insertions(+), 104 deletions(-)
>>  create mode 100644 include/linux/kvm_dirty_ring.h
>>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>>  create mode 100644 virt/kvm/dirty_ring.c
> 
> 
> --
> Cheers,
> Christophe de Dinechin (IRC c3d)
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11 13:04                 ` Michael S. Tsirkin
@ 2019-12-11 14:54                   ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-11 14:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Jason Wang, linux-kernel, kvm,
	Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 11, 2019 at 08:04:36AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 11, 2019 at 10:05:28AM +0100, Paolo Bonzini wrote:
> > On 10/12/19 22:53, Michael S. Tsirkin wrote:
> > > On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> > >> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> > >>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> > >>>>> There is no new infrastructure to track the dirty pages---it's just a
> > >>>>> different way to pass them to userspace.
> > >>>> Did you guys consider using one of the virtio ring formats?
> > >>>> Maybe reusing vhost code?
> > >>>
> > >>> There are no used/available entries here, it's unidirectional
> > >>> (kernel->user).
> > >>
> > >> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
> > >> 100+ LOC only).
> > > 
> > > I guess you don't do polling/ event suppression and other tricks that
> > > virtio came up with for speed then?
> 
> I looked at the code finally, there's actually available, and fetched is
> exactly like used. Not saying existing code is a great fit for you as
> you have an extra slot parameter to pass and it's reversed as compared
> to vhost, with kernel being the driver and userspace the device (even
> though vringh might fit, yet needs to be updated to support packed rings
> though).  But sticking to an existing format is a good idea IMHO,
> or if not I think it's not a bad idea to add some justification.

Right, I'll add a small paragraph in the next cover letter to justify.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 00/15] KVM: Dirty ring interface
  2019-12-11 14:16   ` Paolo Bonzini
@ 2019-12-11 17:15     ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-11 17:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 11, 2019 at 03:16:30PM +0100, Paolo Bonzini wrote:
> On 11/12/19 14:41, Christophe de Dinechin wrote:
> > 
> > Peter Xu writes:
> > 
> >> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> >>
> >> Overview
> >> ============
> >>
> >> This is a continued work from Lei Cao <lei.cao@stratus.com> and Paolo
> >> on the KVM dirty ring interface.  To make it simple, I'll still start
> >> with version 1 as RFC.
> >>
> >> The new dirty ring interface is another way to collect dirty pages for
> >> the virtual machine, but it is different from the existing dirty
> >> logging interface in a few ways, majorly:
> >>
> >>   - Data format: The dirty data was in a ring format rather than a
> >>     bitmap format, so the size of data to sync for dirty logging does
> >>     not depend on the size of guest memory any more, but speed of
> >>     dirtying.  Also, the dirty ring is per-vcpu (currently plus
> >>     another per-vm ring, so total ring number is N+1), while the dirty
> >>     bitmap is per-vm.
> > 
> > I like Sean's suggestion to fetch rings when dirtying. That could reduce
> > the number of dirty rings to examine.
> 
> What do you mean by "fetch rings"?

I'd wildly guess Christophe means something like we create a ring pool
and we try to find a ring to push the dirty gfn when it comes.

OK, should I count it as another vote to Sean's? :)

I agree, but imho larger number of rings won't really be a problem as
long as it's still per-vcpu (after all we have a vcpu number
limitation which is harder to break...).  To me what Sean's suggestion
attracted me most is that the interface is cleaner, that we don't need
to expose the ring in two places any more.  At the meantime, I won't
care too much on perf issue here because after all it's dirty logging.
If perf is critial, then I think I'll certainly choose per-vcpu ring
without doubt even if it complicates the interface because it'll
certainly help on some conditional lockless.

> 
> > Also, as is, this means that the same gfn may be present in multiple
> > rings, right?
> 
> I think the actual marking of a page as dirty is protected by a spinlock
> but I will defer to Peter on this.

In most cases imho we should be with the mmu lock iiuc because the
general mmu page fault will take it.  However I think there're special
cases:

  - when the spte was popped already and just write protected, then
    it's very possible we can go via the quick page fault path
    (fast_page_fault()).  That is lockless (no mmu lock taken).

  - when there's no vcpu context, we'll use the per-vm ring.  Though
    per-vm ring is locked (per-vcpu ring is not!), I don't see how it
    would protect two callers to insert two identical gfns
    sequentially.. Also it can happen between per-vm and per-vcpu ring
    as well.

So I think gfn duplication could happen, but it should be rare.  Even
if it happens, it won't hurt much because the 2nd/3rd/... dirty bit of
the same gfn will simply be skipped by userspace when harvesting.

> 
> Paolo
> 
> >>
> >>   - Data copy: The sync of dirty pages does not need data copy any more,
> >>     but instead the ring is shared between the userspace and kernel by
> >>     page sharings (mmap() on either the vm fd or vcpu fd)
> >>
> >>   - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
> >>     KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
> >>     called KVM_RESET_DIRTY_RINGS when we want to reset the collected
> >>     dirty pages to protected mode again (works like
> >>     KVM_CLEAR_DIRTY_LOG, but ring based)
> >>
> >> And more.
> >>
> >> I would appreciate if the reviewers can start with patch "KVM:
> >> Implement ring-based dirty memory tracking", especially the document
> >> update part for the big picture.  Then I'll avoid copying into most of
> >> them into cover letter again.
> >>
> >> I marked this series as RFC because I'm at least uncertain on this
> >> change of vcpu_enter_guest():
> >>
> >>         if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> >>                 vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >>                 /*
> >>                         * If this is requested, it means that we've
> >>                         * marked the dirty bit in the dirty ring BUT
> >>                         * we've not written the date.  Do it now.
> > 
> > not written the "data" ?

Yep, though I'll drop these lines altogether so we'll be fine.. :)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
                     ` (3 preceding siblings ...)
  2019-12-11 12:53   ` Michael S. Tsirkin
@ 2019-12-11 17:24   ` Christophe de Dinechin
  2019-12-13 20:23     ` Peter Xu
  4 siblings, 1 reply; 123+ messages in thread
From: Christophe de Dinechin @ 2019-12-11 17:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

Peter Xu writes:

> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
>
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.

That statement sort of concerns me. If large parts of memory are
dirtied, won't this cause the rings to fill up quickly enough to cause a
lot of churn between user-space and kernel?

See a possible suggestion to address that below.

> However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
>
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
>
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
>
> We defined two new data structures:
>
>   struct kvm_dirty_ring;
>   struct kvm_dirty_ring_indexes;
>
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
>
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring.  Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
>
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
>
> Currently, we have N+1 rings for each VM of N vcpus:
>
>   - for each vcpu, we have 1 per-vcpu dirty ring,
>   - for each vm, we have 1 per-vm dirty ring
>
> Please refer to the documentation update in this patch for more
> details.
>
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now.  Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
>
> [1] https://patchwork.kernel.org/patch/10471409/
>
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
>  arch/x86/kvm/Makefile          |   3 +-
>  include/linux/kvm_dirty_ring.h |  67 +++++++++
>  include/linux/kvm_host.h       |  33 +++++
>  include/linux/kvm_types.h      |   1 +
>  include/uapi/linux/kvm.h       |  36 +++++
>  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
>  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
>  8 files changed, 642 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  create mode 100644 virt/kvm/dirty_ring.c
>
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 49183add44e7..fa622c9a2eb8 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>  It is thus encouraged to use the vm ioctl to query for capabilities (available
>  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>
> +
>  4.5 KVM_GET_VCPU_MMAP_SIZE
>
>  Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>  memory region.  This ioctl returns the size of that region.  See the
>  KVM_RUN documentation for details.
>
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.

Does the above really belong to this patch?

> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>
>  4.6 KVM_SET_MEMORY_REGION
>
> @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>
>  See KVM_CAP_VCPU_EVENTS for more details.
> +
>  8.20 KVM_CAP_HYPERV_SEND_IPI
>
>  Architectures: x86
> @@ -5365,6 +5379,7 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>
>  Architecture: x86
> @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>  flush hypercalls by Hyper-V) so userspace should disable KVM identification
>  in CPUID and only exposes Hyper-V identification. In this case, guest
>  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu and one global
> +ring per vm.
> +
> +One dirty ring has the following two major structures:
> +
> +struct kvm_dirty_ring {
> +	u16 dirty_index;
> +	u16 reset_index;

What is the benefit of using u16 for that? That means with 4K pages, you
can share at most 256M of dirty memory each time? That seems low to me,
especially since it's sufficient to touch one byte in a page to dirty it.

Actually, this is not consistent with the definition in the code ;-)
So I'll assume it's actually u32.

> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
> +While for each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {
> +        __u32 pad;
> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};

Like other have suggested, I think we might used "pad" to store size
information to be able to dirty large pages more efficiently.

> +
> +The fields in kvm_dirty_ring will be only internal to KVM itself,
> +while the fields in kvm_dirty_ring_indexes will be exposed to
> +userspace to be either read or written.

The sentence above is confusing when contrasted with the "set by kernel"
comment above.

> +
> +The two indices in the ring buffer are free running counters.

Nit: this patch uses both "indices" and "indexes".
Both are correct, but it would be nice to be consistent.

> +
> +In pseudocode, processing the ring buffer looks like this:
> +
> +	idx = load-acquire(&ring->fetch_index);
> +	while (idx != ring->avail_index) {
> +		struct kvm_dirty_gfn *entry;
> +		entry = &ring->dirty_gfns[idx & (size - 1)];
> +		...
> +
> +		idx++;
> +	}
> +	ring->fetch_index = idx;
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

Is there anything in the design that would preclude resizing the ring
buffer at a later time? Presumably, you'd want a large ring while you
are doing things like migrations, but it's mostly useless when you are
not monitoring memory. So it would be nice to be able to call
KVM_ENABLE_CAP at any time to adjust the size.

As I read the current code, one of the issue would be the mapping of the
rings in case of a later extension where we added something beyond the
rings. But I'm not sure that's a big deal at the moment.

> +
> +After the capability is enabled, userspace can mmap the global ring
> +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> +KVM_DIRTY_LOG_PAGE_OFFSET).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.

> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer.  To achieve that, one
> +needs to kick the vcpu out for a hardware buffer flush (vmexit).

When you refer to "buffers", are you referring to the cache lines that
contain the ring buffers, or to something else?

I'm a bit confused by this sentence. I think that you mean that a VCPU
may still be running while you read its ring buffer, in which case the
values in the ring buffer are not necessarily in memory yet, so not
visible to a different CPU. But I wonder if you can't make this
requirement to cause a vmexit unnecessary by carefully ordering the
writes, to make sure that the fetch_index is updated only after the
corresponding ring entries have been written to memory,

In other words, as seen by user-space, you would not care that the ring
entries have not been flushed as long as the fetch_index itself is
guaranteed to still be behind the not-flushed-yet entries.

(I would know how to do that on a different architecture, not sure for x86)

> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> +should pause all the vcpus, then harvest all the dirty pages and
> +rearm the dirty traps. It can unpause the guest after that.

Except for the condition above, why is it necessary to pause other VCPUs
than the one being harvested?


> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>  KVM := ../../../virt/kvm
>
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>
>  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..8335635b7ff7
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,67 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/*
> + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> + *
> + * dirty_ring:  shared with userspace via mmap. It is the compact list
> + *              that holds the dirty pages.
> + * dirty_index: free running counter that points to the next slot in
> + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> + * reset_index: free running counter that points to the next dirty page
> + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> + *              be reenabled
> + * size:        size of the compact list, dirty_ring->dirty_gfns
> + * soft_limit:  when the number of dirty pages in the list reaches this
> + *              limit, vcpu that owns this ring should exit to userspace
> + *              to allow userspace to harvest all the dirty pages
> + * lock:        protects dirty_ring, only in use if this is the global
> + *              ring

If that's not used for vcpu rings, maybe move it out of kvm_dirty_ring?

> + *
> + * The number of dirty pages in the ring is calculated by,
> + * dirty_index - reset_index

Nit: the code calls it "used" (in kvm_dirty_ring_used). Maybe find an
unambiguous terminology. What about "posted", as in

The number of posted dirty pages, i.e. the number of dirty pages in the
ring, is calculated as dirty_index - reset_index by function
kvm_dirty_ring_posted

(Replace "posted" by any adjective of your liking)

> + *
> + * kernel increments dirty_ring->indices.avail_index after dirty index
> + * is incremented. When userspace harvests the dirty pages, it increments
> + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> + * When kernel reenables dirty traps for the dirty pages, it increments
> + * reset_index up to dirty_ring->indices.fetch_index.

Userspace should not be trusted to be doing this, see below.


> + *
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes);
> +
> +/*
> + * returns 0: successfully pushed
> + *         1: successfully pushed, soft limit reached,
> + *            vcpu should exit to userspace
> + *         -EBUSY: unable to push, dirty ring full.
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);

Not very clear what 'i' means, seems to be a page offset based on call sites?

> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 498a39462ac1..7b747bc9ff3e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>  #include <linux/kvm_types.h>
>
>  #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_PENDING_TIMER     2
>  #define KVM_REQ_UNHALT            3
> +#define KVM_REQ_DIRTY_RING_FULL   4
>  #define KVM_REQUEST_ARCH_BASE     8
>
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -321,6 +323,7 @@ struct kvm_vcpu {
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>  };
>
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -501,6 +504,10 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> +	struct kvm_vm_run *vm_run;
> +	u32 dirty_ring_size;
> +	struct kvm_dirty_ring vm_dirty_ring;

If you remove the lock from struct kvm_dirty_ring, you could just put it there.

>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  					gfn_t gfn_offset,
>  					unsigned long mask);
>
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  				struct kvm_dirty_log *log);
>  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures, while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +#ifndef KVM_DIRTY_RING_VERSION
> +#define KVM_DIRTY_RING_VERSION 0
> +#endif
> +
>  #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1c88e69db3d9..d9d03eea145a 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
>  struct kvm_memory_slot;
>  struct kvm_one_reg;
>  struct kvm_run;
> +struct kvm_vm_run;
>  struct kvm_userspace_memory_region;
>  struct kvm_vcpu;
>  struct kvm_vcpu_init;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6f17c8e2dba..0b88d76d6215 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_DIRTY_RING_FULL  29
>
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
>  /* Encounter unexpected vm-exit reason */
>  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -421,6 +427,13 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> +};
> +
> +/* Returned by mmap(kvm->fd, offset=0) */
> +struct kvm_vm_run {
> +	struct kvm_dirty_ring_indexes vm_ring_indexes;
>  };
>
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
>  #define KVM_CAP_ARM_NISV_TO_USER 177
>  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
>  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 size = kvm->dirty_ring_size;
> +
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit =
> +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> +	    kvm_dirty_ring_get_rsvd_entries();

Minor, but what about

       ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();


> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	spin_lock_init(&ring->lock);
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +
> +	fetch = READ_ONCE(indexes->fetch_index);

If I understand correctly, if a malicious user-space writes
ring->reset_index + 1 into fetch_index, the loop below will execute 4
billion times.


> +	if (fetch == ring->reset_index)
> +		return 0;

To protect against scenario above, I would have something like:

	if (fetch - ring->reset_index >= ring->size)
		return -EINVAL;

> +
> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +	/*
> +	 * The ring buffer is shared with userspace, which might mmap
> +	 * it and concurrently modify slot and offset.  Userspace must
> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> +	 * the values after they've been range-checked (the checks are
> +	 * in kvm_reset_dirty_gfn).
> +	 */
> +	smp_read_barrier_depends();
> +	cur_slot = READ_ONCE(entry->slot);
> +	cur_offset = READ_ONCE(entry->offset);
> +	mask = 1;
> +	count++;
> +	ring->reset_index++;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		smp_read_barrier_depends();
> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);
> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.
> +		 */
> +		if (next_slot == cur_slot) {
> +			int delta = next_offset - cur_offset;

Since you diff two u64, shouldn't that be an i64 rather than int?

> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);

So if you did not coalesce the last one, you call kvm_reset_dirty_gfn
twice? Something smells weird about this loop ;-) I have a gut feeling
that it could be done in a single while loop combined with the entry
test, but I may be wrong.


> +
> +	return count;
> +}
> +
> +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return ring->dirty_index - ring->reset_index;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +/*
> + * Returns:
> + *   >0 if we should kick the vcpu out,
> + *   =0 if the gfn pushed successfully, or,
> + *   <0 if error (e.g. ring full)
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock)

Obviously, if you go with the suggestion to have a "lock" only in struct
kvm, then you'd have to pass a lock ptr instead of a bool.

> +{
> +	int ret;
> +	struct kvm_dirty_gfn *entry;
> +
> +	if (lock)
> +		spin_lock(&ring->lock);
> +
> +	if (kvm_dirty_ring_full(ring)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);

Following up on comment about having to vmexit other VCPUs above:
If you have a write barrier for the entry, and then a write once for the
index, isn't that sufficient to ensure that another CPU will pick up the
right values in the right order?


> +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +	pr_info("%s: slot %u offset %llu used %u\n",
> +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:
> +	if (lock)
> +		spin_unlock(&ring->lock);
> +
> +	return ret;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)

Still don't like 'i' :-)


(Stopped my review here for lack of time, decided to share what I had so far)

> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	if (ring->dirty_gfns) {
> +		vfree(ring->dirty_gfns);
> +		ring->dirty_gfns = NULL;
> +	}
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_vcpu *vcpu,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> +		if (r) {
> +			kvm->dirty_ring_size = 0;
> +			goto fail_free_run;
> +		}
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	struct kvm *kvm = kvm_arch_alloc_vm();
>  	int r = -ENOMEM;
>  	int i;
> +	struct page *page;
>
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		r = -ENOMEM;
> +		goto out_err_alloc_page;
> +	}
> +	kvm->vm_run = page_address(page);
> +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_no_srcu;
>  	if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_no_srcu:
> +	free_page((unsigned long)page);
> +	kvm->vm_run = NULL;
> +out_err_alloc_page:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
>  	return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>
> +	if (kvm->dirty_ring_size) {
> +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> +	}
> +
> +	if (kvm->vm_run) {
> +		free_page((unsigned long)kvm->vm_run);
> +		kvm->vm_run = NULL;
> +	}
> +
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
>  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>  #endif
> +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>  	else
>  		return kvm_arch_vcpu_fault(vcpu, vmf);
>  	get_page(page);
> @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #endif
>  	case KVM_CAP_NR_MEMSLOTS:
>  		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		/* Version will be zero if arch didn't implement it */
> +		return KVM_DIRTY_RING_VERSION;
>  	default:
>  		break;
>  	}
>  	return kvm_vm_ioctl_check_extension(kvm, arg);
>  }
>
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	u32 as_id = 0;
> +	u64 offset;
> +	int ret;
> +	struct kvm_dirty_ring *ring;
> +	struct kvm_dirty_ring_indexes *indexes;
> +	bool is_vm_ring;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	if (vcpu) {
> +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	} else {
> +		as_id = 0;
> +		vcpu = kvm_get_running_vcpu();
> +	}
> +
> +	if (vcpu) {
> +		ring = &vcpu->dirty_ring;
> +		indexes = &vcpu->run->vcpu_ring_indexes;
> +		is_vm_ring = false;
> +	} else {
> +		/*
> +		 * Put onto per vm ring because no vcpu context.  Kick
> +		 * vcpu0 if ring is full.
> +		 */
> +		vcpu = kvm->vcpus[0];
> +		ring = &kvm->vm_dirty_ring;
> +		indexes = &kvm->vm_run->vm_ring_indexes;
> +		is_vm_ring = true;
> +	}
> +
> +	ret = kvm_dirty_ring_push(ring, indexes,
> +				  (as_id << 16)|slot->id, offset,
> +				  is_vm_ring);
> +	if (ret < 0) {
> +		if (is_vm_ring)
> +			pr_warn_once("vcpu %d dirty log overflow\n",
> +				     vcpu->vcpu_id);
> +		else
> +			pr_warn_once("per-vm dirty log overflow\n");
> +		return;
> +	}
> +
> +	if (ret)
> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	/* FIXME: we should use a single AND operation, but there is no
> +	 * applicable atomic API.
> +	 */
> +	while (mask) {
> +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> +		mask &= mask - 1;
> +	}
> +
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;
> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
> +		if (r) {
> +			/* Unset dirty ring */
> +			kvm->dirty_ring_size = 0;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
> +					&kvm->vm_run->vm_ring_indexes);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
> +						&vcpu->run->vcpu_ring_indexes);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	return cleared;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		kvm->manual_dirty_log_protect = cap->args[0];
>  		return 0;
>  #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  	case KVM_CHECK_EXTENSION:
>  		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>  		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
> @@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  }
>  #endif
>
> +static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
> +{
> +	struct kvm *kvm = vmf->vma->vm_file->private_data;
> +	struct page *page = NULL;
> +
> +	if (vmf->pgoff == 0)
> +		page = virt_to_page(kvm->vm_run);
> +	else if (kvm_fault_in_dirty_ring(kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &kvm->vm_dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> +	else
> +		return VM_FAULT_SIGBUS;
> +
> +	get_page(page);
> +	vmf->page = page;
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_vm_vm_ops = {
> +	.fault = kvm_vm_fault,
> +};
> +
> +static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	vma->vm_ops = &kvm_vm_vm_ops;
> +	return 0;
> +}
> +
>  static struct file_operations kvm_vm_fops = {
>  	.release        = kvm_vm_release,
>  	.unlocked_ioctl = kvm_vm_ioctl,
> +	.mmap           = kvm_vm_mmap,
>  	.llseek		= noop_llseek,
>  	KVM_COMPAT(kvm_vm_compat_ioctl),
>  };


--
Cheers,
Christophe de Dinechin (IRC c3d)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11 12:53   ` Michael S. Tsirkin
  2019-12-11 14:14     ` Paolo Bonzini
@ 2019-12-11 20:59     ` Peter Xu
  2019-12-11 22:57       ` Michael S. Tsirkin
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-11 20:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 11, 2019 at 07:53:48AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> > 
> > We defined two new data structures:
> > 
> >   struct kvm_dirty_ring;
> >   struct kvm_dirty_ring_indexes;
> > 
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> > 
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring.  Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> > 
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> > 
> > Currently, we have N+1 rings for each VM of N vcpus:
> > 
> >   - for each vcpu, we have 1 per-vcpu dirty ring,
> >   - for each vm, we have 1 per-vm dirty ring
> > 
> > Please refer to the documentation update in this patch for more
> > details.
> > 
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now.  Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> > 
> > [1] https://patchwork.kernel.org/patch/10471409/
> > 
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> 
> Thanks, that's interesting.

Hi, Michael,

Thanks for reading the series.

> 
> > ---
> >  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> >  arch/x86/kvm/Makefile          |   3 +-
> >  include/linux/kvm_dirty_ring.h |  67 +++++++++
> >  include/linux/kvm_host.h       |  33 +++++
> >  include/linux/kvm_types.h      |   1 +
> >  include/uapi/linux/kvm.h       |  36 +++++
> >  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
> >  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
> >  8 files changed, 642 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/kvm_dirty_ring.h
> >  create mode 100644 virt/kvm/dirty_ring.c
> > 
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index 49183add44e7..fa622c9a2eb8 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >  
> > +
> >  4.5 KVM_GET_VCPU_MMAP_SIZE
> >  
> >  Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> >  memory region.  This ioctl returns the size of that region.  See the
> >  KVM_RUN documentation for details.
> >  
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >  
> >  4.6 KVM_SET_MEMORY_REGION
> >  
> 
> PAGE_SIZE being which value? It's not always trivial for
> userspace to know what's the PAGE_SIZE for the kernel ...

I thought it can be easily fetched from getpagesize() or
sysconf(PAGE_SIZE)?  Especially considering that the document should
be for kvm userspace, I'd say it should be common that a hypervisor
process will need to know this probably in other tons of places.. no?

> 
> 
> > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> >  
> >  See KVM_CAP_VCPU_EVENTS for more details.
> > +
> >  8.20 KVM_CAP_HYPERV_SEND_IPI
> >  
> >  Architectures: x86
> > @@ -5365,6 +5379,7 @@ Architectures: x86
> >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> >  hypercalls:
> >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >  
> >  Architecture: x86
> > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> >  in CPUID and only exposes Hyper-V identification. In this case, guest
> >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > +ring per vm.
> > +
> > +One dirty ring has the following two major structures:
> > +
> > +struct kvm_dirty_ring {
> > +	u16 dirty_index;
> > +	u16 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> 
> Sticking these next to each other seems to guarantee cache conflicts.
> 
> Avail/Fetch seems to mimic Virtio's avail/used exactly.  I am not saying
> you must reuse the code really, but I think you should take a hard look
> at e.g. the virtio packed ring structure. We spent a bunch of time
> optimizing it for cache utilization. It seems kernel is the driver,
> making entries available, and userspace the device, using them.
> Again let's not develop a thread about this, but I think
> this is something to consider and discuss in future versions
> of the patches.

I think I completely understand your concern.  We should avoid wasting
time on those are already there.  I'm just afraid that it'll took even
more time to use virtio for this use case while at last we don't
really get much benefit out of it (e.g. most of the virtio features
are not used).

Yeh let's not develop a thread for this topic - I will read more on
virtio before my next post to see whether there's any chance we can
share anything with virtio ring.

> 
> 
> > +};
> > +
> > +While for each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> 
> What does GFN stand for?

It's guest frame number, iiuc.  I'm not the one who named this, but
that's what I understand..

> 
> > +        __u32 pad;
> > +        __u32 slot; /* as_id | slot_id */
> > +        __u64 offset;
> > +};
> 
> offset of what? a 4K page right? Seems like a waste e.g. for
> hugetlbfs... How about replacing pad with size instead?

As Paolo explained, it's the page frame number of the guest.  IIUC
even for hugetlbfs we track dirty bits in 4k size.

> 
> > +
> > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > +userspace to be either read or written.
> 
> I'm not sure what you are trying to say here. kvm_dirty_gfn
> seems to be part of UAPI.

It was talking about kvm_dirty_ring, which is kvm internal and not
exposed to uapi.  While kvm_dirty_gfn is exposed to the users.

> 
> > +
> > +The two indices in the ring buffer are free running counters.
> > +
> > +In pseudocode, processing the ring buffer looks like this:
> > +
> > +	idx = load-acquire(&ring->fetch_index);
> > +	while (idx != ring->avail_index) {
> > +		struct kvm_dirty_gfn *entry;
> > +		entry = &ring->dirty_gfns[idx & (size - 1)];
> > +		...
> > +
> > +		idx++;
> > +	}
> > +	ring->fetch_index = idx;
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings.  It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two.
> 
> All these seem like arbitrary limitations to me.

The dependency of vcpu is partly because we need to create per-vcpu
ring, so it's easier that we don't allow it to change after that.

> 
> Sizing the ring correctly might prove to be a challenge.
> 
> Thus I think there's value in resizing the rings
> without destroying VCPU.

Do you have an example on when we could use this feature?  My wild
guess is that even if we try hard to allow resizing (assuming that
won't bring more bugs, but I hightly doubt...), people may not use it
at all.

The major scenario here is that kvm userspace will be collecting the
dirty bits quickly, so the ring should not really get full easily.
Then the ring size does not really matter much either, as long as it
is bigger than some specific value to avoid vmexits due to full.

How about we start with the simple that we don't allow it to change?
We can do that when the requirement comes.

> 
> Also, power of two just saves a branch here and there,
> but wastes lots of memory. Just wrap the index around to
> 0 and then users can select any size?

Same as above to postpone until we need it?

> 
> 
> 
> >  The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
> 
> OTOH larger buffers put lots of pressure on the system cache.
> 
> > +
> > +After the capability is enabled, userspace can mmap the global ring
> > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > +
> > +Just like for dirty page bitmaps, the buffer tracks writes to
> > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> > +with the flag set, userspace can start harvesting dirty pages from the
> > +ring buffer.
> > +
> > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > +accordingly.  This can be done when the guest is running or paused,
> > +and dirty pages need not be collected all at once.  After processing
> > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > +fetch_index and to mark those pages clean.  Therefore, the ioctl
> > +must be called *before* reading the content of the dirty pages.
> > +
> > +However, there is a major difference comparing to the
> > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > +userspace it's still possible that the kernel has not yet flushed the
> > +hardware dirty buffers into the kernel buffer.  To achieve that, one
> > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> > +
> > +If one of the ring buffers is full, the guest will exit to userspace
> > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > +should pause all the vcpus, then harvest all the dirty pages and
> > +rearm the dirty traps. It can unpause the guest after that.
> 
> This last item means that the performance impact of the feature is
> really hard to predict. Can improve some workloads drastically. Or can
> slow some down.
> 
> 
> One solution could be to actually allow using this together with the
> existing bitmap. Userspace can then decide whether it wants to block
> VCPU on ring full, or just record ring full condition and recover by
> bitmap scanning.

That's true, but again allowing mixture use of the two might bring
extra complexity as well (especially when after adding
KVM_CLEAR_DIRTY_LOG).

My understanding of this is that normally we do only want either one
of them depending on the major workload and the configuration of the
guest.  It's not trivial to try to provide a one-for-all solution.  So
again I would hope we can start from easy, then we extend when we have
better ideas on how to leverage the two interfaces when the ideas
really come, and then we can justify whether it's worth it to work on
that complexity.

> 
> 
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index b19ef421084d..0acee817adfb 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> >  KVM := ../../../virt/kvm
> >  
> >  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > +				$(KVM)/dirty_ring.o
> >  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> >  
> >  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..8335635b7ff7
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,67 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/*
> > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > + *
> > + * dirty_ring:  shared with userspace via mmap. It is the compact list
> > + *              that holds the dirty pages.
> > + * dirty_index: free running counter that points to the next slot in
> > + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> > + * reset_index: free running counter that points to the next dirty page
> > + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> > + *              be reenabled
> > + * size:        size of the compact list, dirty_ring->dirty_gfns
> > + * soft_limit:  when the number of dirty pages in the list reaches this
> > + *              limit, vcpu that owns this ring should exit to userspace
> > + *              to allow userspace to harvest all the dirty pages
> > + * lock:        protects dirty_ring, only in use if this is the global
> > + *              ring
> > + *
> > + * The number of dirty pages in the ring is calculated by,
> > + * dirty_index - reset_index
> > + *
> > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > + * is incremented. When userspace harvests the dirty pages, it increments
> > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > + * When kernel reenables dirty traps for the dirty pages, it increments
> > + * reset_index up to dirty_ring->indices.fetch_index.
> > + *
> > + */
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +
> > +/*
> > + * called with kvm->slots_lock held, returns the number of
> > + * processed pages.
> > + */
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes);
> > +
> > +/*
> > + * returns 0: successfully pushed
> > + *         1: successfully pushed, soft limit reached,
> > + *            vcpu should exit to userspace
> > + *         -EBUSY: unable to push, dirty ring full.
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock);
> > +
> > +/* for use in vm_operations_struct */
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > +
> > +#endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 498a39462ac1..7b747bc9ff3e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -34,6 +34,7 @@
> >  #include <linux/kvm_types.h>
> >  
> >  #include <asm/kvm_host.h>
> > +#include <linux/kvm_dirty_ring.h>
> >  
> >  #ifndef KVM_MAX_VCPU_ID
> >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_PENDING_TIMER     2
> >  #define KVM_REQ_UNHALT            3
> > +#define KVM_REQ_DIRTY_RING_FULL   4
> >  #define KVM_REQUEST_ARCH_BASE     8
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> >  	bool ready;
> >  	struct kvm_vcpu_arch arch;
> >  	struct dentry *debugfs_dentry;
> > +	struct kvm_dirty_ring dirty_ring;
> >  };
> >  
> >  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > @@ -501,6 +504,10 @@ struct kvm {
> >  	struct srcu_struct srcu;
> >  	struct srcu_struct irq_srcu;
> >  	pid_t userspace_pid;
> > +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> > +	struct kvm_vm_run *vm_run;
> > +	u32 dirty_ring_size;
> > +	struct kvm_dirty_ring vm_dirty_ring;
> >  };
> >  
> >  #define kvm_err(fmt, ...) \
> > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >  					gfn_t gfn_offset,
> >  					unsigned long mask);
> >  
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > +
> >  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> >  				struct kvm_dirty_log *log);
> >  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> >  				uintptr_t data, const char *name,
> >  				struct task_struct **thread_ptr);
> >  
> > +/*
> > + * This defines how many reserved entries we want to keep before we
> > + * kick the vcpu to the userspace to avoid dirty ring full.  This
> > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > + */
> > +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> > +
> > +/* Max number of entries allowed for each kvm dirty ring */
> > +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > +
> > +/*
> > + * Arch needs to define these macro after implementing the dirty ring
> > + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > + * starting page offset of the dirty ring structures,
> 
> Confused. Offset where? You set a default for everyone - where does arch
> want to override it?

If arch defines KVM_DIRTY_LOG_PAGE_OFFSET then below will be a no-op,
please see [1] on #ifndef.

> 
> > while
> > + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> > + * feature is off on all archs.
> > + */
> > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET

[1]

> > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > +#endif
> > +#ifndef KVM_DIRTY_RING_VERSION
> > +#define KVM_DIRTY_RING_VERSION 0
> > +#endif
> 
> One way versioning, with no bits and negotiation
> will make it hard to change down the road.
> what's wrong with existing KVM capabilities that
> you feel there's a need for dedicated versioning for this?

Frankly speaking I don't even think it'll change in the near
future.. :)

Yeh kvm versioning could work too.  Here we can also return a zero
just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
original patchset, but it's really helpless either because it's
defined in uapi), but I just don't see how it helps...  So I returned
a version number just in case we'd like to change the layout some day
and when we don't want to bother introducing another cap bit for the
same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

> 
> > +
> >  #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 1c88e69db3d9..d9d03eea145a 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> >  struct kvm_memory_slot;
> >  struct kvm_one_reg;
> >  struct kvm_run;
> > +struct kvm_vm_run;
> >  struct kvm_userspace_memory_region;
> >  struct kvm_vcpu;
> >  struct kvm_vcpu_init;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e6f17c8e2dba..0b88d76d6215 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> >  #define KVM_EXIT_IOAPIC_EOI       26
> >  #define KVM_EXIT_HYPERV           27
> >  #define KVM_EXIT_ARM_NISV         28
> > +#define KVM_EXIT_DIRTY_RING_FULL  29
> >  
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> >  /* Encounter unexpected vm-exit reason */
> >  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
> >  
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> >  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> >  struct kvm_run {
> >  	/* in */
> > @@ -421,6 +427,13 @@ struct kvm_run {
> >  		struct kvm_sync_regs regs;
> >  		char padding[SYNC_REGS_SIZE_BYTES];
> >  	} s;
> > +
> > +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > +};
> > +
> > +/* Returned by mmap(kvm->fd, offset=0) */
> > +struct kvm_vm_run {
> > +	struct kvm_dirty_ring_indexes vm_ring_indexes;
> >  };
> >  
> >  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> >  #define KVM_CAP_ARM_NISV_TO_USER 177
> >  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > +#define KVM_CAP_DIRTY_LOG_RING 179
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> >  /* Available with KVM_CAP_ARM_SVE */
> >  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
> >  
> > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> > +
> >  /* Secure Encrypted Virtualization command */
> >  enum sev_cmd_id {
> >  	/* Guest initialization commands */
> > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> >  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
> >  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
> >  
> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + *    of kvm_write_* so that the global dirty ring is not filled up
> > + *    too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + *    enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + *    dirty bitmap with KVM's.
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > +	__u32 pad;
> > +	__u32 slot;
> > +	__u64 offset;
> > +};
> > +
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 size = kvm->dirty_ring_size;
> > +
> > +	ring->dirty_gfns = vmalloc(size);
> 
> So 1/2 a megabyte of kernel memory per VM that userspace locks up.
> Do we really have to though? Why not get a userspace pointer,
> write it with copy to user, and sidestep all this?

I'd say it won't be a big issue on locking 1/2M of host mem for a
vm...

Also note that if dirty ring is enabled, I plan to evaporate the
dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
$GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
less memory used.

> 
> > +	if (!ring->dirty_gfns)
> > +		return -ENOMEM;
> > +	memset(ring->dirty_gfns, 0, size);
> > +
> > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > +	ring->soft_limit =
> > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > +	    kvm_dirty_ring_get_rsvd_entries();
> > +	ring->dirty_index = 0;
> > +	ring->reset_index = 0;
> > +	spin_lock_init(&ring->lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes)
> > +{
> > +	u32 cur_slot, next_slot;
> > +	u64 cur_offset, next_offset;
> > +	unsigned long mask;
> > +	u32 fetch;
> > +	int count = 0;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	fetch = READ_ONCE(indexes->fetch_index);
> > +	if (fetch == ring->reset_index)
> > +		return 0;
> > +
> > +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +	/*
> > +	 * The ring buffer is shared with userspace, which might mmap
> > +	 * it and concurrently modify slot and offset.  Userspace must
> > +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> > +	 * the values after they've been range-checked (the checks are
> > +	 * in kvm_reset_dirty_gfn).
> 
> What it doesn't is prevent speculative attacks.  That's why things like
> copy from user have a speculation barrier.  Instead of worrying about
> that, unless it's really critical, I think you'd do well do just use
> copy to/from user.

IMHO I would really hope these data be there without swapped out of
memory, just like what we did with kvm->dirty_bitmap... it's on the
hot path of mmu page fault, even we could be with mmu lock held if
copy_to_user() page faulted.  But indeed I've no experience on
avoiding speculative attacks, suggestions would be greatly welcomed on
that.  In our case we do (index & (size - 1)), so is it still
suffering from speculative attacks?

> 
> > +	 */
> > +	smp_read_barrier_depends();
> 
> What depends on what here? Looks suspicious ...

Hmm, I think maybe it can be removed because the entry pointer
reference below should be an ordering constraint already?

> 
> > +	cur_slot = READ_ONCE(entry->slot);
> > +	cur_offset = READ_ONCE(entry->offset);
> > +	mask = 1;
> > +	count++;
> > +	ring->reset_index++;
> > +	while (ring->reset_index != fetch) {
> > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +		smp_read_barrier_depends();
> 
> same concerns here
> 
> > +		next_slot = READ_ONCE(entry->slot);
> > +		next_offset = READ_ONCE(entry->offset);
> > +		ring->reset_index++;
> > +		count++;
> > +		/*
> > +		 * Try to coalesce the reset operations when the guest is
> > +		 * scanning pages in the same slot.
> 
> what does guest scanning mean?

My wild guess is that it means when the guest is accessing the pages
continuously so the dirty gfns are continuous too.  Anyway I agree
it's not clear, where I can try to rephrase.

> 
> > +		 */
> > +		if (next_slot == cur_slot) {
> > +			int delta = next_offset - cur_offset;
> > +
> > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > +				mask |= 1ull << delta;
> > +				continue;
> > +			}
> > +
> > +			/* Backwards visit, careful about overflows!  */
> > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > +			    (mask << -delta >> -delta) == mask) {
> > +				cur_offset = next_offset;
> > +				mask = (mask << -delta) | 1;
> > +				continue;
> > +			}
> > +		}
> > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +		cur_slot = next_slot;
> > +		cur_offset = next_offset;
> > +		mask = 1;
> > +	}
> > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +
> > +	return count;
> > +}
> > +
> > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > +{
> > +	return ring->dirty_index - ring->reset_index;
> > +}
> > +
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > +{
> > +	return kvm_dirty_ring_used(ring) >= ring->size;
> > +}
> > +
> > +/*
> > + * Returns:
> > + *   >0 if we should kick the vcpu out,
> > + *   =0 if the gfn pushed successfully, or,
> > + *   <0 if error (e.g. ring full)
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock)
> > +{
> > +	int ret;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	if (lock)
> > +		spin_lock(&ring->lock);
> 
> what's the story around locking here? Why is it safe
> not to take the lock sometimes?

kvm_dirty_ring_push() will be with lock==true only when the per-vm
ring is used.  For per-vcpu ring, because that will only happen with
the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
is called with lock==false).

> 
> > +
> > +	if (kvm_dirty_ring_full(ring)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > +	entry->slot = slot;
> > +	entry->offset = offset;
> > +	smp_wmb();
> > +	ring->dirty_index++;
> > +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> > +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > +	pr_info("%s: slot %u offset %llu used %u\n",
> > +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> > +
> > +out:
> > +	if (lock)
> > +		spin_unlock(&ring->lock);
> > +
> > +	return ret;
> > +}
> > +
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> > +{
> > +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> > +}
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > +{
> > +	if (ring->dirty_gfns) {
> > +		vfree(ring->dirty_gfns);
> > +		ring->dirty_gfns = NULL;
> > +	}
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 681452d288cd..8642c977629b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -64,6 +64,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/kvm.h>
> >  
> > +#include <linux/kvm_dirty_ring.h>
> > +
> >  /* Worst case buffer size needed for holding an integer. */
> >  #define ITOA_MAX_LEN 12
> >  
> > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  				    struct kvm_vcpu *vcpu,
> >  				    struct kvm_memory_slot *memslot,
> >  				    gfn_t gfn);
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *slot,
> > +				    gfn_t gfn);
> >  
> >  __visible bool kvm_rebooting;
> >  EXPORT_SYMBOL_GPL(kvm_rebooting);
> > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	vcpu->preempted = false;
> >  	vcpu->ready = false;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > +		if (r) {
> > +			kvm->dirty_ring_size = 0;
> > +			goto fail_free_run;
> > +		}
> > +	}
> > +
> >  	r = kvm_arch_vcpu_init(vcpu);
> >  	if (r < 0)
> > -		goto fail_free_run;
> > +		goto fail_free_ring;
> >  	return 0;
> >  
> > +fail_free_ring:
> > +	if (kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  fail_free_run:
> >  	free_page((unsigned long)vcpu->run);
> >  fail:
> > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> >  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> >  	kvm_arch_vcpu_uninit(vcpu);
> >  	free_page((unsigned long)vcpu->run);
> > +	if (vcpu->kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> >  
> > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  	struct kvm *kvm = kvm_arch_alloc_vm();
> >  	int r = -ENOMEM;
> >  	int i;
> > +	struct page *page;
> >  
> >  	if (!kvm)
> >  		return ERR_PTR(-ENOMEM);
> > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  
> >  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >  
> > +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +	if (!page) {
> > +		r = -ENOMEM;
> > +		goto out_err_alloc_page;
> > +	}
> > +	kvm->vm_run = page_address(page);
> 
> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> still. What is wrong with just a pointer and calling put_user?

I want to make it the start point for sharing fields between
user/kernel per-vm.  Just like kvm_run for per-vcpu.

IMHO it'll be awkward if we always introduce a new interface just to
take a pointer of the userspace buffer and cache it...  I'd say so far
I like the design of kvm_run and alike because it's efficient, easy to
use, and easy for extensions.

> 
> > +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > +
> >  	if (init_srcu_struct(&kvm->srcu))
> >  		goto out_err_no_srcu;
> >  	if (init_srcu_struct(&kvm->irq_srcu))
> > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  out_err_no_irq_srcu:
> >  	cleanup_srcu_struct(&kvm->srcu);
> >  out_err_no_srcu:
> > +	free_page((unsigned long)page);
> > +	kvm->vm_run = NULL;
> > +out_err_alloc_page:
> >  	kvm_arch_free_vm(kvm);
> >  	mmdrop(current->mm);
> >  	return ERR_PTR(r);
> > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >  	int i;
> >  	struct mm_struct *mm = kvm->mm;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > +	}
> > +
> > +	if (kvm->vm_run) {
> > +		free_page((unsigned long)kvm->vm_run);
> > +		kvm->vm_run = NULL;
> > +	}
> > +
> >  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> >  	kvm_destroy_vm_debugfs(kvm);
> >  	kvm_arch_sync_events(kvm);
> > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  {
> >  	if (memslot && memslot->dirty_bitmap) {
> >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > -
> > +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> >  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> >  	}
> >  }
> > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> >  
> > +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> > +{
> > +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> > +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> > +	     kvm->dirty_ring_size / PAGE_SIZE);
> > +}
> > +
> >  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> >  {
> >  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> > @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> >  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> >  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> >  #endif
> > +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> > +		page = kvm_dirty_ring_get_page(
> > +		    &vcpu->dirty_ring,
> > +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> >  	else
> >  		return kvm_arch_vcpu_fault(vcpu, vmf);
> >  	get_page(page);
> > @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  #endif
> >  	case KVM_CAP_NR_MEMSLOTS:
> >  		return KVM_USER_MEM_SLOTS;
> > +	case KVM_CAP_DIRTY_LOG_RING:
> > +		/* Version will be zero if arch didn't implement it */
> > +		return KVM_DIRTY_RING_VERSION;
> >  	default:
> >  		break;
> >  	}
> >  	return kvm_vm_ioctl_check_extension(kvm, arg);
> >  }
> >  
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *slot,
> > +				    gfn_t gfn)
> > +{
> > +	u32 as_id = 0;
> > +	u64 offset;
> > +	int ret;
> > +	struct kvm_dirty_ring *ring;
> > +	struct kvm_dirty_ring_indexes *indexes;
> > +	bool is_vm_ring;
> > +
> > +	if (!kvm->dirty_ring_size)
> > +		return;
> > +
> > +	offset = gfn - slot->base_gfn;
> > +
> > +	if (vcpu) {
> > +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > +	} else {
> > +		as_id = 0;
> > +		vcpu = kvm_get_running_vcpu();
> > +	}
> > +
> > +	if (vcpu) {
> > +		ring = &vcpu->dirty_ring;
> > +		indexes = &vcpu->run->vcpu_ring_indexes;
> > +		is_vm_ring = false;
> > +	} else {
> > +		/*
> > +		 * Put onto per vm ring because no vcpu context.  Kick
> > +		 * vcpu0 if ring is full.
> 
> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> critical tasks there, they will be penalized disproportionally?

Reasonable question.  So far we can't avoid it because vcpu exit is
the event mechanism to say "hey please collect dirty bits".  Maybe
someway is better than this, but I'll need to rethink all these
over...

> 
> > +		 */
> > +		vcpu = kvm->vcpus[0];
> > +		ring = &kvm->vm_dirty_ring;
> > +		indexes = &kvm->vm_run->vm_ring_indexes;
> > +		is_vm_ring = true;
> > +	}
> > +
> > +	ret = kvm_dirty_ring_push(ring, indexes,
> > +				  (as_id << 16)|slot->id, offset,
> > +				  is_vm_ring);
> > +	if (ret < 0) {
> > +		if (is_vm_ring)
> > +			pr_warn_once("vcpu %d dirty log overflow\n",
> > +				     vcpu->vcpu_id);
> > +		else
> > +			pr_warn_once("per-vm dirty log overflow\n");
> > +		return;
> > +	}
> > +
> > +	if (ret)
> > +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> > +}
> > +
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > +{
> > +	struct kvm_memory_slot *memslot;
> > +	int as_id, id;
> > +
> > +	as_id = slot >> 16;
> > +	id = (u16)slot;
> > +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> > +		return;
> > +
> > +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> > +	if (offset >= memslot->npages)
> > +		return;
> > +
> > +	spin_lock(&kvm->mmu_lock);
> > +	/* FIXME: we should use a single AND operation, but there is no
> > +	 * applicable atomic API.
> > +	 */
> > +	while (mask) {
> > +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > +		mask &= mask - 1;
> > +	}
> > +
> > +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > +	spin_unlock(&kvm->mmu_lock);
> > +}
> > +
> > +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > +{
> > +	int r;
> > +
> > +	/* the size should be power of 2 */
> > +	if (!size || (size & (size - 1)))
> > +		return -EINVAL;
> > +
> > +	/* Should be bigger to keep the reserved entries, or a page */
> > +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> > +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> > +		return -EINVAL;
> > +
> > +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> > +	    sizeof(struct kvm_dirty_gfn))
> > +		return -E2BIG;
> 
> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> So how does userspace know what's legal?
> Do you expect it to just try?

Yep that's what I thought. :)

Please grep E2BIG in QEMU repo target/i386/kvm.c...  won't be hard to
do imho..

> More likely it will just copy the number from kernel and can
> never ever make it smaller.

Not sure, but for sure I can probably move KVM_DIRTY_RING_MAX_ENTRIES
to uapi too.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11 20:59     ` Peter Xu
@ 2019-12-11 22:57       ` Michael S. Tsirkin
  2019-12-12  0:08         ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-11 22:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 11, 2019 at 03:59:52PM -0500, Peter Xu wrote:
> On Wed, Dec 11, 2019 at 07:53:48AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > This patch is heavily based on previous work from Lei Cao
> > > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > > 
> > > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > > are copied to userspace when userspace queries KVM for its dirty page
> > > information.  The use of bitmaps is mostly sufficient for live
> > > migration, as large parts of memory are be dirtied from one log-dirty
> > > pass to another.  However, in a checkpointing system, the number of
> > > dirty pages is small and in fact it is often bounded---the VM is
> > > paused when it has dirtied a pre-defined number of pages. Traversing a
> > > large, sparsely populated bitmap to find set bits is time-consuming,
> > > as is copying the bitmap to user-space.
> > > 
> > > A similar issue will be there for live migration when the guest memory
> > > is huge while the page dirty procedure is trivial.  In that case for
> > > each dirty sync we need to pull the whole dirty bitmap to userspace
> > > and analyse every bit even if it's mostly zeros.
> > > 
> > > The preferred data structure for above scenarios is a dense list of
> > > guest frame numbers (GFN).  This patch series stores the dirty list in
> > > kernel memory that can be memory mapped into userspace to allow speedy
> > > harvesting.
> > > 
> > > We defined two new data structures:
> > > 
> > >   struct kvm_dirty_ring;
> > >   struct kvm_dirty_ring_indexes;
> > > 
> > > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > > ring.
> > > 
> > > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > > user/kernel interface of each ring.  Currently it contains two
> > > indexes: (1) avail_index represents where we should push our next
> > > PFN (written by kernel), while (2) fetch_index represents where the
> > > userspace should fetch the next dirty PFN (written by userspace).
> > > 
> > > One complete ring is composed by one kvm_dirty_ring plus its
> > > corresponding kvm_dirty_ring_indexes.
> > > 
> > > Currently, we have N+1 rings for each VM of N vcpus:
> > > 
> > >   - for each vcpu, we have 1 per-vcpu dirty ring,
> > >   - for each vm, we have 1 per-vm dirty ring
> > > 
> > > Please refer to the documentation update in this patch for more
> > > details.
> > > 
> > > Note that this patch implements the core logic of dirty ring buffer.
> > > It's still disabled for all archs for now.  Also, we'll address some
> > > of the other issues in follow up patches before it's firstly enabled
> > > on x86.
> > > 
> > > [1] https://patchwork.kernel.org/patch/10471409/
> > > 
> > > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > 
> > Thanks, that's interesting.
> 
> Hi, Michael,
> 
> Thanks for reading the series.
> 
> > 
> > > ---
> > >  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> > >  arch/x86/kvm/Makefile          |   3 +-
> > >  include/linux/kvm_dirty_ring.h |  67 +++++++++
> > >  include/linux/kvm_host.h       |  33 +++++
> > >  include/linux/kvm_types.h      |   1 +
> > >  include/uapi/linux/kvm.h       |  36 +++++
> > >  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
> > >  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
> > >  8 files changed, 642 insertions(+), 3 deletions(-)
> > >  create mode 100644 include/linux/kvm_dirty_ring.h
> > >  create mode 100644 virt/kvm/dirty_ring.c
> > > 
> > > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > > index 49183add44e7..fa622c9a2eb8 100644
> > > --- a/Documentation/virt/kvm/api.txt
> > > +++ b/Documentation/virt/kvm/api.txt
> > > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> > >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> > >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> > >  
> > > +
> > >  4.5 KVM_GET_VCPU_MMAP_SIZE
> > >  
> > >  Capability: basic
> > > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> > >  memory region.  This ioctl returns the size of that region.  See the
> > >  KVM_RUN documentation for details.
> > >  
> > > +Besides the size of the KVM_RUN communication region, other areas of
> > > +the VCPU file descriptor can be mmap-ed, including:
> > > +
> > > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> > > +
> > > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > > +
> > >  
> > >  4.6 KVM_SET_MEMORY_REGION
> > >  
> > 
> > PAGE_SIZE being which value? It's not always trivial for
> > userspace to know what's the PAGE_SIZE for the kernel ...
> 
> I thought it can be easily fetched from getpagesize() or
> sysconf(PAGE_SIZE)?  Especially considering that the document should
> be for kvm userspace, I'd say it should be common that a hypervisor
> process will need to know this probably in other tons of places.. no?
> 
> > 
> > 
> > > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> > >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> > >  
> > >  See KVM_CAP_VCPU_EVENTS for more details.
> > > +
> > >  8.20 KVM_CAP_HYPERV_SEND_IPI
> > >  
> > >  Architectures: x86
> > > @@ -5365,6 +5379,7 @@ Architectures: x86
> > >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> > >  hypercalls:
> > >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > > +
> > >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> > >  
> > >  Architecture: x86
> > > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> > >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> > >  in CPUID and only exposes Hyper-V identification. In this case, guest
> > >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > > +
> > > +8.22 KVM_CAP_DIRTY_LOG_RING
> > > +
> > > +Architectures: x86
> > > +Parameters: args[0] - size of the dirty log ring
> > > +
> > > +KVM is capable of tracking dirty memory using ring buffers that are
> > > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > > +ring per vm.
> > > +
> > > +One dirty ring has the following two major structures:
> > > +
> > > +struct kvm_dirty_ring {
> > > +	u16 dirty_index;
> > > +	u16 reset_index;
> > > +	u32 size;
> > > +	u32 soft_limit;
> > > +	spinlock_t lock;
> > > +	struct kvm_dirty_gfn *dirty_gfns;
> > > +};
> > > +
> > > +struct kvm_dirty_ring_indexes {
> > > +	__u32 avail_index; /* set by kernel */
> > > +	__u32 fetch_index; /* set by userspace */
> > 
> > Sticking these next to each other seems to guarantee cache conflicts.
> > 
> > Avail/Fetch seems to mimic Virtio's avail/used exactly.  I am not saying
> > you must reuse the code really, but I think you should take a hard look
> > at e.g. the virtio packed ring structure. We spent a bunch of time
> > optimizing it for cache utilization. It seems kernel is the driver,
> > making entries available, and userspace the device, using them.
> > Again let's not develop a thread about this, but I think
> > this is something to consider and discuss in future versions
> > of the patches.
> 
> I think I completely understand your concern.  We should avoid wasting
> time on those are already there.  I'm just afraid that it'll took even
> more time to use virtio for this use case while at last we don't
> really get much benefit out of it (e.g. most of the virtio features
> are not used).
> 
> Yeh let's not develop a thread for this topic - I will read more on
> virtio before my next post to see whether there's any chance we can
> share anything with virtio ring.
> 
> > 
> > 
> > > +};
> > > +
> > > +While for each of the dirty entry it's defined as:
> > > +
> > > +struct kvm_dirty_gfn {
> > 
> > What does GFN stand for?
> 
> It's guest frame number, iiuc.  I'm not the one who named this, but
> that's what I understand..
> 
> > 
> > > +        __u32 pad;
> > > +        __u32 slot; /* as_id | slot_id */
> > > +        __u64 offset;
> > > +};
> > 
> > offset of what? a 4K page right? Seems like a waste e.g. for
> > hugetlbfs... How about replacing pad with size instead?
> 
> As Paolo explained, it's the page frame number of the guest.  IIUC
> even for hugetlbfs we track dirty bits in 4k size.
> 
> > 
> > > +
> > > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > > +userspace to be either read or written.
> > 
> > I'm not sure what you are trying to say here. kvm_dirty_gfn
> > seems to be part of UAPI.
> 
> It was talking about kvm_dirty_ring, which is kvm internal and not
> exposed to uapi.  While kvm_dirty_gfn is exposed to the users.
> 
> > 
> > > +
> > > +The two indices in the ring buffer are free running counters.
> > > +
> > > +In pseudocode, processing the ring buffer looks like this:
> > > +
> > > +	idx = load-acquire(&ring->fetch_index);
> > > +	while (idx != ring->avail_index) {
> > > +		struct kvm_dirty_gfn *entry;
> > > +		entry = &ring->dirty_gfns[idx & (size - 1)];
> > > +		...
> > > +
> > > +		idx++;
> > > +	}
> > > +	ring->fetch_index = idx;
> > > +
> > > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > > +to enable this capability for the new guest and set the size of the
> > > +rings.  It is only allowed before creating any vCPU, and the size of
> > > +the ring must be a power of two.
> > 
> > All these seem like arbitrary limitations to me.
> 
> The dependency of vcpu is partly because we need to create per-vcpu
> ring, so it's easier that we don't allow it to change after that.
> 
> > 
> > Sizing the ring correctly might prove to be a challenge.
> > 
> > Thus I think there's value in resizing the rings
> > without destroying VCPU.
> 
> Do you have an example on when we could use this feature?

So e.g. start with a small ring, and if you see stalls too often
increase it? Otherwise I don't see how does one decide
on ring size.

>  My wild
> guess is that even if we try hard to allow resizing (assuming that
> won't bring more bugs, but I hightly doubt...), people may not use it
> at all.
> 
> The major scenario here is that kvm userspace will be collecting the
> dirty bits quickly, so the ring should not really get full easily.
> Then the ring size does not really matter much either, as long as it
> is bigger than some specific value to avoid vmexits due to full.

Exactly but I don't see how you are going to find that value
unless it's auto-tuning dynamically.

> How about we start with the simple that we don't allow it to change?
> We can do that when the requirement comes.
> 
> > 
> > Also, power of two just saves a branch here and there,
> > but wastes lots of memory. Just wrap the index around to
> > 0 and then users can select any size?
> 
> Same as above to postpone until we need it?

It's to save memory, don't we always need to do that?

> > 
> > 
> > 
> > >  The larger the ring buffer, the less
> > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > +optimal size depends on the workload, but it is recommended that it be
> > > +at least 64 KiB (4096 entries).
> > 
> > OTOH larger buffers put lots of pressure on the system cache.
> > 
> > > +
> > > +After the capability is enabled, userspace can mmap the global ring
> > > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > > +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> > > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > > +
> > > +Just like for dirty page bitmaps, the buffer tracks writes to
> > > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > > +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> > > +with the flag set, userspace can start harvesting dirty pages from the
> > > +ring buffer.
> > > +
> > > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > > +accordingly.  This can be done when the guest is running or paused,
> > > +and dirty pages need not be collected all at once.  After processing
> > > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > > +fetch_index and to mark those pages clean.  Therefore, the ioctl
> > > +must be called *before* reading the content of the dirty pages.
> > > +
> > > +However, there is a major difference comparing to the
> > > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > > +userspace it's still possible that the kernel has not yet flushed the
> > > +hardware dirty buffers into the kernel buffer.  To achieve that, one
> > > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> > > +
> > > +If one of the ring buffers is full, the guest will exit to userspace
> > > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > > +should pause all the vcpus, then harvest all the dirty pages and
> > > +rearm the dirty traps. It can unpause the guest after that.
> > 
> > This last item means that the performance impact of the feature is
> > really hard to predict. Can improve some workloads drastically. Or can
> > slow some down.
> > 
> > 
> > One solution could be to actually allow using this together with the
> > existing bitmap. Userspace can then decide whether it wants to block
> > VCPU on ring full, or just record ring full condition and recover by
> > bitmap scanning.
> 
> That's true, but again allowing mixture use of the two might bring
> extra complexity as well (especially when after adding
> KVM_CLEAR_DIRTY_LOG).
> 
> My understanding of this is that normally we do only want either one
> of them depending on the major workload and the configuration of the
> guest.

And again how does one know which to enable? No one has the
time to fine-tune gazillion parameters.

>  It's not trivial to try to provide a one-for-all solution.  So
> again I would hope we can start from easy, then we extend when we have
> better ideas on how to leverage the two interfaces when the ideas
> really come, and then we can justify whether it's worth it to work on
> that complexity.

It's less *coding* work to build a simple thing but it need much more *testing*.

IMHO a huge amount of benchmarking has to happen if you just want to
set this loose on users as default with these kind of
limitations. We need to be sure that even though in theory
it can be very bad, in practice it's actually good.
If it's auto-tuning then it's a much easier sell to upstream
even if there's a chance of some regressions.

> > 
> > 
> > > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > > index b19ef421084d..0acee817adfb 100644
> > > --- a/arch/x86/kvm/Makefile
> > > +++ b/arch/x86/kvm/Makefile
> > > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> > >  KVM := ../../../virt/kvm
> > >  
> > >  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > > -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > > +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > > +				$(KVM)/dirty_ring.o
> > >  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> > >  
> > >  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> > > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > > new file mode 100644
> > > index 000000000000..8335635b7ff7
> > > --- /dev/null
> > > +++ b/include/linux/kvm_dirty_ring.h
> > > @@ -0,0 +1,67 @@
> > > +#ifndef KVM_DIRTY_RING_H
> > > +#define KVM_DIRTY_RING_H
> > > +
> > > +/*
> > > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > > + *
> > > + * dirty_ring:  shared with userspace via mmap. It is the compact list
> > > + *              that holds the dirty pages.
> > > + * dirty_index: free running counter that points to the next slot in
> > > + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> > > + * reset_index: free running counter that points to the next dirty page
> > > + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> > > + *              be reenabled
> > > + * size:        size of the compact list, dirty_ring->dirty_gfns
> > > + * soft_limit:  when the number of dirty pages in the list reaches this
> > > + *              limit, vcpu that owns this ring should exit to userspace
> > > + *              to allow userspace to harvest all the dirty pages
> > > + * lock:        protects dirty_ring, only in use if this is the global
> > > + *              ring
> > > + *
> > > + * The number of dirty pages in the ring is calculated by,
> > > + * dirty_index - reset_index
> > > + *
> > > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > > + * is incremented. When userspace harvests the dirty pages, it increments
> > > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > > + * When kernel reenables dirty traps for the dirty pages, it increments
> > > + * reset_index up to dirty_ring->indices.fetch_index.
> > > + *
> > > + */
> > > +struct kvm_dirty_ring {
> > > +	u32 dirty_index;
> > > +	u32 reset_index;
> > > +	u32 size;
> > > +	u32 soft_limit;
> > > +	spinlock_t lock;
> > > +	struct kvm_dirty_gfn *dirty_gfns;
> > > +};
> > > +
> > > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > > +
> > > +/*
> > > + * called with kvm->slots_lock held, returns the number of
> > > + * processed pages.
> > > + */
> > > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > > +			 struct kvm_dirty_ring *ring,
> > > +			 struct kvm_dirty_ring_indexes *indexes);
> > > +
> > > +/*
> > > + * returns 0: successfully pushed
> > > + *         1: successfully pushed, soft limit reached,
> > > + *            vcpu should exit to userspace
> > > + *         -EBUSY: unable to push, dirty ring full.
> > > + */
> > > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > > +			struct kvm_dirty_ring_indexes *indexes,
> > > +			u32 slot, u64 offset, bool lock);
> > > +
> > > +/* for use in vm_operations_struct */
> > > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> > > +
> > > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > > +
> > > +#endif
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 498a39462ac1..7b747bc9ff3e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -34,6 +34,7 @@
> > >  #include <linux/kvm_types.h>
> > >  
> > >  #include <asm/kvm_host.h>
> > > +#include <linux/kvm_dirty_ring.h>
> > >  
> > >  #ifndef KVM_MAX_VCPU_ID
> > >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> > >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > >  #define KVM_REQ_PENDING_TIMER     2
> > >  #define KVM_REQ_UNHALT            3
> > > +#define KVM_REQ_DIRTY_RING_FULL   4
> > >  #define KVM_REQUEST_ARCH_BASE     8
> > >  
> > >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> > >  	bool ready;
> > >  	struct kvm_vcpu_arch arch;
> > >  	struct dentry *debugfs_dentry;
> > > +	struct kvm_dirty_ring dirty_ring;
> > >  };
> > >  
> > >  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > > @@ -501,6 +504,10 @@ struct kvm {
> > >  	struct srcu_struct srcu;
> > >  	struct srcu_struct irq_srcu;
> > >  	pid_t userspace_pid;
> > > +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> > > +	struct kvm_vm_run *vm_run;
> > > +	u32 dirty_ring_size;
> > > +	struct kvm_dirty_ring vm_dirty_ring;
> > >  };
> > >  
> > >  #define kvm_err(fmt, ...) \
> > > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > >  					gfn_t gfn_offset,
> > >  					unsigned long mask);
> > >  
> > > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > > +
> > >  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> > >  				struct kvm_dirty_log *log);
> > >  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> > >  				uintptr_t data, const char *name,
> > >  				struct task_struct **thread_ptr);
> > >  
> > > +/*
> > > + * This defines how many reserved entries we want to keep before we
> > > + * kick the vcpu to the userspace to avoid dirty ring full.  This
> > > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > > + */
> > > +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> > > +
> > > +/* Max number of entries allowed for each kvm dirty ring */
> > > +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > > +
> > > +/*
> > > + * Arch needs to define these macro after implementing the dirty ring
> > > + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > > + * starting page offset of the dirty ring structures,
> > 
> > Confused. Offset where? You set a default for everyone - where does arch
> > want to override it?
> 
> If arch defines KVM_DIRTY_LOG_PAGE_OFFSET then below will be a no-op,
> please see [1] on #ifndef.

So which arches need to override it? Why do you say they should?

> > 
> > > while
> > > + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> > > + * feature is off on all archs.
> > > + */
> > > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> 
> [1]
> 
> > > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > > +#endif
> > > +#ifndef KVM_DIRTY_RING_VERSION
> > > +#define KVM_DIRTY_RING_VERSION 0
> > > +#endif
> > 
> > One way versioning, with no bits and negotiation
> > will make it hard to change down the road.
> > what's wrong with existing KVM capabilities that
> > you feel there's a need for dedicated versioning for this?
> 
> Frankly speaking I don't even think it'll change in the near
> future.. :)
> 
> Yeh kvm versioning could work too.  Here we can also return a zero
> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
> original patchset, but it's really helpless either because it's
> defined in uapi), but I just don't see how it helps...  So I returned
> a version number just in case we'd like to change the layout some day
> and when we don't want to bother introducing another cap bit for the
> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

I guess it's up to Paolo but really I don't see the point.
You can add a version later when it means something ...

> > 
> > > +
> > >  #endif
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 1c88e69db3d9..d9d03eea145a 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> > >  struct kvm_memory_slot;
> > >  struct kvm_one_reg;
> > >  struct kvm_run;
> > > +struct kvm_vm_run;
> > >  struct kvm_userspace_memory_region;
> > >  struct kvm_vcpu;
> > >  struct kvm_vcpu_init;
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index e6f17c8e2dba..0b88d76d6215 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> > >  #define KVM_EXIT_IOAPIC_EOI       26
> > >  #define KVM_EXIT_HYPERV           27
> > >  #define KVM_EXIT_ARM_NISV         28
> > > +#define KVM_EXIT_DIRTY_RING_FULL  29
> > >  
> > >  /* For KVM_EXIT_INTERNAL_ERROR */
> > >  /* Emulate instruction failed. */
> > > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> > >  /* Encounter unexpected vm-exit reason */
> > >  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
> > >  
> > > +struct kvm_dirty_ring_indexes {
> > > +	__u32 avail_index; /* set by kernel */
> > > +	__u32 fetch_index; /* set by userspace */
> > > +};
> > > +
> > >  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> > >  struct kvm_run {
> > >  	/* in */
> > > @@ -421,6 +427,13 @@ struct kvm_run {
> > >  		struct kvm_sync_regs regs;
> > >  		char padding[SYNC_REGS_SIZE_BYTES];
> > >  	} s;
> > > +
> > > +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > > +};
> > > +
> > > +/* Returned by mmap(kvm->fd, offset=0) */
> > > +struct kvm_vm_run {
> > > +	struct kvm_dirty_ring_indexes vm_ring_indexes;
> > >  };
> > >  
> > >  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> > >  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> > >  #define KVM_CAP_ARM_NISV_TO_USER 177
> > >  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > > +#define KVM_CAP_DIRTY_LOG_RING 179
> > >  
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >  
> > > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> > >  /* Available with KVM_CAP_ARM_SVE */
> > >  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
> > >  
> > > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > > +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> > > +
> > >  /* Secure Encrypted Virtualization command */
> > >  enum sev_cmd_id {
> > >  	/* Guest initialization commands */
> > > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> > >  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
> > >  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
> > >  
> > > +/*
> > > + * The following are the requirements for supporting dirty log ring
> > > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > > + *
> > > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > > + *    of kvm_write_* so that the global dirty ring is not filled up
> > > + *    too quickly.
> > > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > > + *    enabling dirty logging.
> > > + * 3. There should not be a separate step to synchronize hardware
> > > + *    dirty bitmap with KVM's.
> > > + */
> > > +
> > > +struct kvm_dirty_gfn {
> > > +	__u32 pad;
> > > +	__u32 slot;
> > > +	__u64 offset;
> > > +};
> > > +
> > >  #endif /* __LINUX_KVM_H */
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > new file mode 100644
> > > index 000000000000..9264891f3c32
> > > --- /dev/null
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -0,0 +1,156 @@
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/kvm.h>
> > > +#include <linux/vmalloc.h>
> > > +#include <linux/kvm_dirty_ring.h>
> > > +
> > > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > > +{
> > > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > > +}
> > > +
> > > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > +{
> > > +	u32 size = kvm->dirty_ring_size;
> > > +
> > > +	ring->dirty_gfns = vmalloc(size);
> > 
> > So 1/2 a megabyte of kernel memory per VM that userspace locks up.
> > Do we really have to though? Why not get a userspace pointer,
> > write it with copy to user, and sidestep all this?
> 
> I'd say it won't be a big issue on locking 1/2M of host mem for a
> vm...
> Also note that if dirty ring is enabled, I plan to evaporate the
> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> less memory used.

Right - I think Avi described the bitmap in kernel memory as one of
design mistakes. Why repeat that with the new design?

> > 
> > > +	if (!ring->dirty_gfns)
> > > +		return -ENOMEM;
> > > +	memset(ring->dirty_gfns, 0, size);
> > > +
> > > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > > +	ring->soft_limit =
> > > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > > +	    kvm_dirty_ring_get_rsvd_entries();
> > > +	ring->dirty_index = 0;
> > > +	ring->reset_index = 0;
> > > +	spin_lock_init(&ring->lock);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > > +			 struct kvm_dirty_ring *ring,
> > > +			 struct kvm_dirty_ring_indexes *indexes)
> > > +{
> > > +	u32 cur_slot, next_slot;
> > > +	u64 cur_offset, next_offset;
> > > +	unsigned long mask;
> > > +	u32 fetch;
> > > +	int count = 0;
> > > +	struct kvm_dirty_gfn *entry;
> > > +
> > > +	fetch = READ_ONCE(indexes->fetch_index);
> > > +	if (fetch == ring->reset_index)
> > > +		return 0;
> > > +
> > > +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > +	/*
> > > +	 * The ring buffer is shared with userspace, which might mmap
> > > +	 * it and concurrently modify slot and offset.  Userspace must
> > > +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> > > +	 * the values after they've been range-checked (the checks are
> > > +	 * in kvm_reset_dirty_gfn).
> > 
> > What it doesn't is prevent speculative attacks.  That's why things like
> > copy from user have a speculation barrier.  Instead of worrying about
> > that, unless it's really critical, I think you'd do well do just use
> > copy to/from user.
> 
> IMHO I would really hope these data be there without swapped out of
> memory, just like what we did with kvm->dirty_bitmap... it's on the
> hot path of mmu page fault, even we could be with mmu lock held if
> copy_to_user() page faulted.  But indeed I've no experience on
> avoiding speculative attacks, suggestions would be greatly welcomed on
> that.  In our case we do (index & (size - 1)), so is it still
> suffering from speculative attacks?

I don't say I understand everything in depth.
Just reacting to this:
	READ_ONCE prevents the compiler from changing
	the values after they've been range-checked (the checks are
	in kvm_reset_dirty_gfn)

so any range checks you do can be attacked.

And the safest way to avoid the attacks is to do what most
kernel does and use copy from/to user when you talk to
userspace. Avoid annoying things like bypassing SMAP too.


> > 
> > > +	 */
> > > +	smp_read_barrier_depends();
> > 
> > What depends on what here? Looks suspicious ...
> 
> Hmm, I think maybe it can be removed because the entry pointer
> reference below should be an ordering constraint already?
> 
> > 
> > > +	cur_slot = READ_ONCE(entry->slot);
> > > +	cur_offset = READ_ONCE(entry->offset);
> > > +	mask = 1;
> > > +	count++;
> > > +	ring->reset_index++;
> > > +	while (ring->reset_index != fetch) {
> > > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > +		smp_read_barrier_depends();
> > 
> > same concerns here
> > 
> > > +		next_slot = READ_ONCE(entry->slot);
> > > +		next_offset = READ_ONCE(entry->offset);
> > > +		ring->reset_index++;
> > > +		count++;
> > > +		/*
> > > +		 * Try to coalesce the reset operations when the guest is
> > > +		 * scanning pages in the same slot.
> > 
> > what does guest scanning mean?
> 
> My wild guess is that it means when the guest is accessing the pages
> continuously so the dirty gfns are continuous too.  Anyway I agree
> it's not clear, where I can try to rephrase.
> 
> > 
> > > +		 */
> > > +		if (next_slot == cur_slot) {
> > > +			int delta = next_offset - cur_offset;
> > > +
> > > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > > +				mask |= 1ull << delta;
> > > +				continue;
> > > +			}
> > > +
> > > +			/* Backwards visit, careful about overflows!  */
> > > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > > +			    (mask << -delta >> -delta) == mask) {
> > > +				cur_offset = next_offset;
> > > +				mask = (mask << -delta) | 1;
> > > +				continue;
> > > +			}
> > > +		}
> > > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +		cur_slot = next_slot;
> > > +		cur_offset = next_offset;
> > > +		mask = 1;
> > > +	}
> > > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > > +{
> > > +	return ring->dirty_index - ring->reset_index;
> > > +}
> > > +
> > > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > > +{
> > > +	return kvm_dirty_ring_used(ring) >= ring->size;
> > > +}
> > > +
> > > +/*
> > > + * Returns:
> > > + *   >0 if we should kick the vcpu out,
> > > + *   =0 if the gfn pushed successfully, or,
> > > + *   <0 if error (e.g. ring full)
> > > + */
> > > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > > +			struct kvm_dirty_ring_indexes *indexes,
> > > +			u32 slot, u64 offset, bool lock)
> > > +{
> > > +	int ret;
> > > +	struct kvm_dirty_gfn *entry;
> > > +
> > > +	if (lock)
> > > +		spin_lock(&ring->lock);
> > 
> > what's the story around locking here? Why is it safe
> > not to take the lock sometimes?
> 
> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> ring is used.  For per-vcpu ring, because that will only happen with
> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> is called with lock==false).
> 
> > 
> > > +
> > > +	if (kvm_dirty_ring_full(ring)) {
> > > +		ret = -EBUSY;
> > > +		goto out;
> > > +	}
> > > +
> > > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > +	entry->slot = slot;
> > > +	entry->offset = offset;
> > > +	smp_wmb();
> > > +	ring->dirty_index++;
> > > +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> > > +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > > +	pr_info("%s: slot %u offset %llu used %u\n",
> > > +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> > > +
> > > +out:
> > > +	if (lock)
> > > +		spin_unlock(&ring->lock);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> > > +{
> > > +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> > > +}
> > > +
> > > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > > +{
> > > +	if (ring->dirty_gfns) {
> > > +		vfree(ring->dirty_gfns);
> > > +		ring->dirty_gfns = NULL;
> > > +	}
> > > +}
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 681452d288cd..8642c977629b 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -64,6 +64,8 @@
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/kvm.h>
> > >  
> > > +#include <linux/kvm_dirty_ring.h>
> > > +
> > >  /* Worst case buffer size needed for holding an integer. */
> > >  #define ITOA_MAX_LEN 12
> > >  
> > > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > >  				    struct kvm_vcpu *vcpu,
> > >  				    struct kvm_memory_slot *memslot,
> > >  				    gfn_t gfn);
> > > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > > +				    struct kvm_vcpu *vcpu,
> > > +				    struct kvm_memory_slot *slot,
> > > +				    gfn_t gfn);
> > >  
> > >  __visible bool kvm_rebooting;
> > >  EXPORT_SYMBOL_GPL(kvm_rebooting);
> > > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> > >  	vcpu->preempted = false;
> > >  	vcpu->ready = false;
> > >  
> > > +	if (kvm->dirty_ring_size) {
> > > +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > > +		if (r) {
> > > +			kvm->dirty_ring_size = 0;
> > > +			goto fail_free_run;
> > > +		}
> > > +	}
> > > +
> > >  	r = kvm_arch_vcpu_init(vcpu);
> > >  	if (r < 0)
> > > -		goto fail_free_run;
> > > +		goto fail_free_ring;
> > >  	return 0;
> > >  
> > > +fail_free_ring:
> > > +	if (kvm->dirty_ring_size)
> > > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> > >  fail_free_run:
> > >  	free_page((unsigned long)vcpu->run);
> > >  fail:
> > > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> > >  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> > >  	kvm_arch_vcpu_uninit(vcpu);
> > >  	free_page((unsigned long)vcpu->run);
> > > +	if (vcpu->kvm->dirty_ring_size)
> > > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> > >  
> > > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >  	struct kvm *kvm = kvm_arch_alloc_vm();
> > >  	int r = -ENOMEM;
> > >  	int i;
> > > +	struct page *page;
> > >  
> > >  	if (!kvm)
> > >  		return ERR_PTR(-ENOMEM);
> > > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >  
> > >  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> > >  
> > > +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > > +	if (!page) {
> > > +		r = -ENOMEM;
> > > +		goto out_err_alloc_page;
> > > +	}
> > > +	kvm->vm_run = page_address(page);
> > 
> > So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> > still. What is wrong with just a pointer and calling put_user?
> 
> I want to make it the start point for sharing fields between
> user/kernel per-vm.  Just like kvm_run for per-vcpu.

And why is doing that without get/put user a good idea?
If nothing else this bypasses SMAP, exploits can pass
data from userspace to kernel through that.

> IMHO it'll be awkward if we always introduce a new interface just to
> take a pointer of the userspace buffer and cache it...  I'd say so far
> I like the design of kvm_run and alike because it's efficient, easy to
> use, and easy for extensions.


Well kvm run at least isn't accessed when kernel is processing it.
And the structure there is dead simple, not a tricky lockless ring
with indices and things.

Again I might be wrong, eventually it's up to kvm maintainers.  But
really there's a standard thing all drivers do to talk to userspace, and
if there's no special reason to do otherwise I would do exactly it.

> > 
> > > +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > > +
> > >  	if (init_srcu_struct(&kvm->srcu))
> > >  		goto out_err_no_srcu;
> > >  	if (init_srcu_struct(&kvm->irq_srcu))
> > > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >  out_err_no_irq_srcu:
> > >  	cleanup_srcu_struct(&kvm->srcu);
> > >  out_err_no_srcu:
> > > +	free_page((unsigned long)page);
> > > +	kvm->vm_run = NULL;
> > > +out_err_alloc_page:
> > >  	kvm_arch_free_vm(kvm);
> > >  	mmdrop(current->mm);
> > >  	return ERR_PTR(r);
> > > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > >  	int i;
> > >  	struct mm_struct *mm = kvm->mm;
> > >  
> > > +	if (kvm->dirty_ring_size) {
> > > +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > > +	}
> > > +
> > > +	if (kvm->vm_run) {
> > > +		free_page((unsigned long)kvm->vm_run);
> > > +		kvm->vm_run = NULL;
> > > +	}
> > > +
> > >  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> > >  	kvm_destroy_vm_debugfs(kvm);
> > >  	kvm_arch_sync_events(kvm);
> > > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > >  {
> > >  	if (memslot && memslot->dirty_bitmap) {
> > >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > > -
> > > +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> > >  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > >  	}
> > >  }
> > > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> > >  
> > > +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> > > +{
> > > +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> > > +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> > > +	     kvm->dirty_ring_size / PAGE_SIZE);
> > > +}
> > > +
> > >  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > >  {
> > >  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> > > @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > >  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> > >  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> > >  #endif
> > > +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> > > +		page = kvm_dirty_ring_get_page(
> > > +		    &vcpu->dirty_ring,
> > > +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> > >  	else
> > >  		return kvm_arch_vcpu_fault(vcpu, vmf);
> > >  	get_page(page);
> > > @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > >  #endif
> > >  	case KVM_CAP_NR_MEMSLOTS:
> > >  		return KVM_USER_MEM_SLOTS;
> > > +	case KVM_CAP_DIRTY_LOG_RING:
> > > +		/* Version will be zero if arch didn't implement it */
> > > +		return KVM_DIRTY_RING_VERSION;
> > >  	default:
> > >  		break;
> > >  	}
> > >  	return kvm_vm_ioctl_check_extension(kvm, arg);
> > >  }
> > >  
> > > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > > +				    struct kvm_vcpu *vcpu,
> > > +				    struct kvm_memory_slot *slot,
> > > +				    gfn_t gfn)
> > > +{
> > > +	u32 as_id = 0;
> > > +	u64 offset;
> > > +	int ret;
> > > +	struct kvm_dirty_ring *ring;
> > > +	struct kvm_dirty_ring_indexes *indexes;
> > > +	bool is_vm_ring;
> > > +
> > > +	if (!kvm->dirty_ring_size)
> > > +		return;
> > > +
> > > +	offset = gfn - slot->base_gfn;
> > > +
> > > +	if (vcpu) {
> > > +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > > +	} else {
> > > +		as_id = 0;
> > > +		vcpu = kvm_get_running_vcpu();
> > > +	}
> > > +
> > > +	if (vcpu) {
> > > +		ring = &vcpu->dirty_ring;
> > > +		indexes = &vcpu->run->vcpu_ring_indexes;
> > > +		is_vm_ring = false;
> > > +	} else {
> > > +		/*
> > > +		 * Put onto per vm ring because no vcpu context.  Kick
> > > +		 * vcpu0 if ring is full.
> > 
> > What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> > critical tasks there, they will be penalized disproportionally?
> 
> Reasonable question.  So far we can't avoid it because vcpu exit is
> the event mechanism to say "hey please collect dirty bits".  Maybe
> someway is better than this, but I'll need to rethink all these
> over...

Maybe signal an eventfd, and let userspace worry about deciding what to
do.

> > 
> > > +		 */
> > > +		vcpu = kvm->vcpus[0];
> > > +		ring = &kvm->vm_dirty_ring;
> > > +		indexes = &kvm->vm_run->vm_ring_indexes;
> > > +		is_vm_ring = true;
> > > +	}
> > > +
> > > +	ret = kvm_dirty_ring_push(ring, indexes,
> > > +				  (as_id << 16)|slot->id, offset,
> > > +				  is_vm_ring);
> > > +	if (ret < 0) {
> > > +		if (is_vm_ring)
> > > +			pr_warn_once("vcpu %d dirty log overflow\n",
> > > +				     vcpu->vcpu_id);
> > > +		else
> > > +			pr_warn_once("per-vm dirty log overflow\n");
> > > +		return;
> > > +	}
> > > +
> > > +	if (ret)
> > > +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> > > +}
> > > +
> > > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > > +{
> > > +	struct kvm_memory_slot *memslot;
> > > +	int as_id, id;
> > > +
> > > +	as_id = slot >> 16;
> > > +	id = (u16)slot;
> > > +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> > > +		return;
> > > +
> > > +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> > > +	if (offset >= memslot->npages)
> > > +		return;
> > > +
> > > +	spin_lock(&kvm->mmu_lock);
> > > +	/* FIXME: we should use a single AND operation, but there is no
> > > +	 * applicable atomic API.
> > > +	 */
> > > +	while (mask) {
> > > +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > > +		mask &= mask - 1;
> > > +	}
> > > +
> > > +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > > +	spin_unlock(&kvm->mmu_lock);
> > > +}
> > > +
> > > +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > > +{
> > > +	int r;
> > > +
> > > +	/* the size should be power of 2 */
> > > +	if (!size || (size & (size - 1)))
> > > +		return -EINVAL;
> > > +
> > > +	/* Should be bigger to keep the reserved entries, or a page */
> > > +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> > > +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> > > +	    sizeof(struct kvm_dirty_gfn))
> > > +		return -E2BIG;
> > 
> > KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> > So how does userspace know what's legal?
> > Do you expect it to just try?
> 
> Yep that's what I thought. :)
> 
> Please grep E2BIG in QEMU repo target/i386/kvm.c...  won't be hard to
> do imho..

I don't see anything except just failing. Do we really have something
trying to find a working value? What would even be a reasonable range?
Start from UINT_MAX and work down? In which increments?
This is just a ton of overhead for what could have been a
simple query.

> > More likely it will just copy the number from kernel and can
> > never ever make it smaller.
> 
> Not sure, but for sure I can probably move KVM_DIRTY_RING_MAX_ENTRIES
> to uapi too.
> 
> Thanks,

Won't help as you can't change it ever then.
You need it runtime discoverable.
Or again, keep it in userspace memory and then you don't
really care what size it is.


> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11 22:57       ` Michael S. Tsirkin
@ 2019-12-12  0:08         ` Paolo Bonzini
  2019-12-12  7:36           ` Michael S. Tsirkin
  2019-12-15 17:33           ` Peter Xu
  0 siblings, 2 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-12  0:08 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 11/12/19 23:57, Michael S. Tsirkin wrote:
>>> All these seem like arbitrary limitations to me.
>>>
>>> Sizing the ring correctly might prove to be a challenge.
>>>
>>> Thus I think there's value in resizing the rings
>>> without destroying VCPU.
>>
>> Do you have an example on when we could use this feature?
> 
> So e.g. start with a small ring, and if you see stalls too often
> increase it? Otherwise I don't see how does one decide
> on ring size.

If you see stalls often, it means the guest is dirtying memory very
fast.  Harvesting the ring puts back pressure on the guest, you may
prefer a smaller ring size to avoid a bufferbloat-like situation.

Note that having a larger ring is better, even though it does incur a
memory cost, because it means the migration thread will be able to reap
the ring buffer asynchronously with no vmexits.

With smaller ring sizes the cost of flushing the TLB when resetting the
rings goes up, but the initial bulk copy phase _will_ have vmexits and
then having to reap more dirty rings becomes more expensive and
introduces some jitter.  So it will require some experimentation to find
an optimal value.

Anyway if in the future we go for resizable rings, KVM_ENABLE_CAP can be
passed the largest desired size and then another ioctl can be introduced
to set the mask for indices.

>>> Also, power of two just saves a branch here and there,
>>> but wastes lots of memory. Just wrap the index around to
>>> 0 and then users can select any size?
>>
>> Same as above to postpone until we need it?
> 
> It's to save memory, don't we always need to do that?

Does it really save that much memory?  Would it really be so beneficial
to choose 12K entries rather than 8K or 16K in the ring?

>> My understanding of this is that normally we do only want either one
>> of them depending on the major workload and the configuration of the
>> guest.
> 
> And again how does one know which to enable? No one has the
> time to fine-tune gazillion parameters.

Hopefully we can always use just the ring buffer.

> IMHO a huge amount of benchmarking has to happen if you just want to
> set this loose on users as default with these kind of
> limitations. We need to be sure that even though in theory
> it can be very bad, in practice it's actually good.
> If it's auto-tuning then it's a much easier sell to upstream
> even if there's a chance of some regressions.

Auto-tuning is not a silver bullet, it requires just as much
benchmarking to make sure that it doesn't oscillate crazily and that it
actually outperforms a simple fixed size.

>> Yeh kvm versioning could work too.  Here we can also return a zero
>> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
>> original patchset, but it's really helpless either because it's
>> defined in uapi), but I just don't see how it helps...  So I returned
>> a version number just in case we'd like to change the layout some day
>> and when we don't want to bother introducing another cap bit for the
>> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
>> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
> 
> I guess it's up to Paolo but really I don't see the point.
> You can add a version later when it means something ...

Yeah, we can return the maximum size of the ring buffer, too.

>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>> vm...
>> Also note that if dirty ring is enabled, I plan to evaporate the
>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>> less memory used.
> 
> Right - I think Avi described the bitmap in kernel memory as one of
> design mistakes. Why repeat that with the new design?

Do you have a source for that?  At least the dirty bitmap has to be
accessed from atomic context so it seems unlikely that it can be moved
to user memory.

The dirty ring could use user memory indeed, but it would be much harder
to set up (multiple ioctls for each ring?  what to do if userspace
forgets one? etc.).  The mmap API is easier to use.

>>>> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>>>> +	/*
>>>> +	 * The ring buffer is shared with userspace, which might mmap
>>>> +	 * it and concurrently modify slot and offset.  Userspace must
>>>> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
>>>> +	 * the values after they've been range-checked (the checks are
>>>> +	 * in kvm_reset_dirty_gfn).
>>>
>>> What it doesn't is prevent speculative attacks.  That's why things like
>>> copy from user have a speculation barrier.  Instead of worrying about
>>> that, unless it's really critical, I think you'd do well do just use
>>> copy to/from user.

An unconditional speculation barrier (lfence) is also expensive.  We
already have macros to add speculation checks with array_index_nospec at
the right places, for example __kvm_memslots.  We should add an
array_index_nospec to id_to_memslot as well.  I'll send a patch for that.

>>> What depends on what here? Looks suspicious ...
>>
>> Hmm, I think maybe it can be removed because the entry pointer
>> reference below should be an ordering constraint already?

entry->xxx depends on ring->reset_index.

>>> what's the story around locking here? Why is it safe
>>> not to take the lock sometimes?
>>
>> kvm_dirty_ring_push() will be with lock==true only when the per-vm
>> ring is used.  For per-vcpu ring, because that will only happen with
>> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
>> is called with lock==false).

FWIW this will be done much more nicely in v2.

>>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> +	if (!page) {
>>>> +		r = -ENOMEM;
>>>> +		goto out_err_alloc_page;
>>>> +	}
>>>> +	kvm->vm_run = page_address(page);
>>>
>>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
>>> still. What is wrong with just a pointer and calling put_user?
>>
>> I want to make it the start point for sharing fields between
>> user/kernel per-vm.  Just like kvm_run for per-vcpu.

This page is actually not needed at all.  Userspace can just map at
KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
kvm_vm_run completely.

>>>> +	} else {
>>>> +		/*
>>>> +		 * Put onto per vm ring because no vcpu context.  Kick
>>>> +		 * vcpu0 if ring is full.
>>>
>>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
>>> critical tasks there, they will be penalized disproportionally?
>>
>> Reasonable question.  So far we can't avoid it because vcpu exit is
>> the event mechanism to say "hey please collect dirty bits".  Maybe
>> someway is better than this, but I'll need to rethink all these
>> over...
> 
> Maybe signal an eventfd, and let userspace worry about deciding what to
> do.

This has to be done synchronously.  But the vm ring should be used very
rarely (it's for things like kvmclock updates that write to guest memory
outside a vCPU), possibly a handful of times in the whole run of the VM.

>>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
>>> So how does userspace know what's legal?
>>> Do you expect it to just try?
>>
>> Yep that's what I thought. :)

We should return it for KVM_CHECK_EXTENSION.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-12  0:08         ` Paolo Bonzini
@ 2019-12-12  7:36           ` Michael S. Tsirkin
  2019-12-12  8:12             ` Paolo Bonzini
  2019-12-15 17:33           ` Peter Xu
  1 sibling, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-12  7:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >> I'd say it won't be a big issue on locking 1/2M of host mem for a
> >> vm...
> >> Also note that if dirty ring is enabled, I plan to evaporate the
> >> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> >> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
> >> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> >> less memory used.
> > 
> > Right - I think Avi described the bitmap in kernel memory as one of
> > design mistakes. Why repeat that with the new design?
> 
> Do you have a source for that?

Nope, it was a private talk.

> At least the dirty bitmap has to be
> accessed from atomic context so it seems unlikely that it can be moved
> to user memory.

Why is that? We could surely do it from VCPU context?

> The dirty ring could use user memory indeed, but it would be much harder
> to set up (multiple ioctls for each ring?  what to do if userspace
> forgets one? etc.).

Why multiple ioctls? If you do like virtio packed ring you just need the
base and the size.

-- 
MST


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-12  7:36           ` Michael S. Tsirkin
@ 2019-12-12  8:12             ` Paolo Bonzini
  2019-12-12 10:38               ` Michael S. Tsirkin
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-12  8:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Xu, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On 12/12/19 08:36, Michael S. Tsirkin wrote:
> On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
>>>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>>>> vm...
>>>> Also note that if dirty ring is enabled, I plan to evaporate the
>>>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>>>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
>>>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>>>> less memory used.
>>>
>>> Right - I think Avi described the bitmap in kernel memory as one of
>>> design mistakes. Why repeat that with the new design?
>>
>> Do you have a source for that?
> 
> Nope, it was a private talk.
> 
>> At least the dirty bitmap has to be
>> accessed from atomic context so it seems unlikely that it can be moved
>> to user memory.
> 
> Why is that? We could surely do it from VCPU context?

Spinlock is taken.

>> The dirty ring could use user memory indeed, but it would be much harder
>> to set up (multiple ioctls for each ring?  what to do if userspace
>> forgets one? etc.).
> 
> Why multiple ioctls? If you do like virtio packed ring you just need the
> base and the size.

You have multiple rings, so multiple invocations of one ioctl.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-12  8:12             ` Paolo Bonzini
@ 2019-12-12 10:38               ` Michael S. Tsirkin
  0 siblings, 0 replies; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-12 10:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Thu, Dec 12, 2019 at 09:12:04AM +0100, Paolo Bonzini wrote:
> On 12/12/19 08:36, Michael S. Tsirkin wrote:
> > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>>> I'd say it won't be a big issue on locking 1/2M of host mem for a
> >>>> vm...
> >>>> Also note that if dirty ring is enabled, I plan to evaporate the
> >>>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> >>>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
> >>>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> >>>> less memory used.
> >>>
> >>> Right - I think Avi described the bitmap in kernel memory as one of
> >>> design mistakes. Why repeat that with the new design?
> >>
> >> Do you have a source for that?
> > 
> > Nope, it was a private talk.
> > 
> >> At least the dirty bitmap has to be
> >> accessed from atomic context so it seems unlikely that it can be moved
> >> to user memory.
> > 
> > Why is that? We could surely do it from VCPU context?
> 
> Spinlock is taken.

Right, that's an implementation detail though isn't it?

> >> The dirty ring could use user memory indeed, but it would be much harder
> >> to set up (multiple ioctls for each ring?  what to do if userspace
> >> forgets one? etc.).
> > 
> > Why multiple ioctls? If you do like virtio packed ring you just need the
> > base and the size.
> 
> You have multiple rings, so multiple invocations of one ioctl.
> 
> Paolo

Oh. So when you said "multiple ioctls for each ring" - I guess you
meant: "multiple ioctls - one for each ring"?

And it's true, but then it allows supporting things like resize in a
clean way without any effort in the kernel. You get a new ring address -
you switch to that one.

-- 
MST


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-11 17:24   ` Christophe de Dinechin
@ 2019-12-13 20:23     ` Peter Xu
  2019-12-14  7:57       ` Paolo Bonzini
  2019-12-20 18:19       ` Peter Xu
  0 siblings, 2 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-13 20:23 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Wed, Dec 11, 2019 at 06:24:00PM +0100, Christophe de Dinechin wrote:
> Peter Xu writes:
> 
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> >
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.
> 
> That statement sort of concerns me. If large parts of memory are
> dirtied, won't this cause the rings to fill up quickly enough to cause a
> lot of churn between user-space and kernel?

We have cpu-throttle in the QEMU to explicitly provide some "churns"
just to slow the vcpu down.  If dirtying is heavy during migrations
then we might prefer some churns..  Also, this should not replace the
old dirty_bitmap, but it should be a new interface only.  Even if we
want to switch this as default we'll definitely still keep the old
interface when the user wants it in some scenarios.

> 
> See a possible suggestion to address that below.
> 
> > However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> >
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> >
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> >
> > We defined two new data structures:
> >
> >   struct kvm_dirty_ring;
> >   struct kvm_dirty_ring_indexes;
> >
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> >
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring.  Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> >
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> >
> > Currently, we have N+1 rings for each VM of N vcpus:
> >
> >   - for each vcpu, we have 1 per-vcpu dirty ring,
> >   - for each vm, we have 1 per-vm dirty ring
> >
> > Please refer to the documentation update in this patch for more
> > details.
> >
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now.  Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> >
> > [1] https://patchwork.kernel.org/patch/10471409/
> >
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> >  arch/x86/kvm/Makefile          |   3 +-
> >  include/linux/kvm_dirty_ring.h |  67 +++++++++
> >  include/linux/kvm_host.h       |  33 +++++
> >  include/linux/kvm_types.h      |   1 +
> >  include/uapi/linux/kvm.h       |  36 +++++
> >  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
> >  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
> >  8 files changed, 642 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/kvm_dirty_ring.h
> >  create mode 100644 virt/kvm/dirty_ring.c
> >
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index 49183add44e7..fa622c9a2eb8 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >
> > +
> >  4.5 KVM_GET_VCPU_MMAP_SIZE
> >
> >  Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> >  memory region.  This ioctl returns the size of that region.  See the
> >  KVM_RUN documentation for details.
> >
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> 
> Does the above really belong to this patch?

Probably not..  But sure I can move that out in my next post.

> 
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >
> >  4.6 KVM_SET_MEMORY_REGION
> >
> > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> >
> >  See KVM_CAP_VCPU_EVENTS for more details.
> > +
> >  8.20 KVM_CAP_HYPERV_SEND_IPI
> >
> >  Architectures: x86
> > @@ -5365,6 +5379,7 @@ Architectures: x86
> >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> >  hypercalls:
> >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >
> >  Architecture: x86
> > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> >  in CPUID and only exposes Hyper-V identification. In this case, guest
> >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > +ring per vm.
> > +
> > +One dirty ring has the following two major structures:
> > +
> > +struct kvm_dirty_ring {
> > +	u16 dirty_index;
> > +	u16 reset_index;
> 
> What is the benefit of using u16 for that? That means with 4K pages, you
> can share at most 256M of dirty memory each time? That seems low to me,
> especially since it's sufficient to touch one byte in a page to dirty it.
> 
> Actually, this is not consistent with the definition in the code ;-)
> So I'll assume it's actually u32.

Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
more. :)

I think even u16 would be mostly enough (if you see, the maximum
allowed value currently is 64K entries only, not a big one).  Again,
the thing is that the userspace should be collecting the dirty bits,
so the ring shouldn't reach full easily.  Even if it does, we should
probably let it stop for a while as explained above.  It'll be
inefficient only if we set it to a too-small value, imho.

> 
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> > +While for each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > +        __u32 pad;
> > +        __u32 slot; /* as_id | slot_id */
> > +        __u64 offset;
> > +};
> 
> Like other have suggested, I think we might used "pad" to store size
> information to be able to dirty large pages more efficiently.

As explained in the other thread, KVM should only trap dirty bits in
4K granularity, never in huge page sizes.

> 
> > +
> > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > +userspace to be either read or written.
> 
> The sentence above is confusing when contrasted with the "set by kernel"
> comment above.

Maybe "kvm_dirty_ring_indexes will be exposed to both KVM and
userspace" to be clearer?

"set by kernel" means kernel will write to it, then the userspace will
still need to read from it.

> 
> > +
> > +The two indices in the ring buffer are free running counters.
> 
> Nit: this patch uses both "indices" and "indexes".
> Both are correct, but it would be nice to be consistent.

I'll respect the original patch to change everything into indices.

> 
> > +
> > +In pseudocode, processing the ring buffer looks like this:
> > +
> > +	idx = load-acquire(&ring->fetch_index);
> > +	while (idx != ring->avail_index) {
> > +		struct kvm_dirty_gfn *entry;
> > +		entry = &ring->dirty_gfns[idx & (size - 1)];
> > +		...
> > +
> > +		idx++;
> > +	}
> > +	ring->fetch_index = idx;
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings.  It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two.  The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
> 
> Is there anything in the design that would preclude resizing the ring
> buffer at a later time? Presumably, you'd want a large ring while you
> are doing things like migrations, but it's mostly useless when you are
> not monitoring memory. So it would be nice to be able to call
> KVM_ENABLE_CAP at any time to adjust the size.

It'll be scary to me to have it be adjusted at any time...  Even
during pushing dirty gfns onto the ring?  We need to handle all these
complexities...

IMHO it really does not help that much to have such a feature, or I'd
appreciate we can allow to start from simple.

> 
> As I read the current code, one of the issue would be the mapping of the
> rings in case of a later extension where we added something beyond the
> rings. But I'm not sure that's a big deal at the moment.

I think we must define something to be sure that the ring mapped pages
will be limited, so we can still extend things.  IMHO that's why I
introduced the maximum allowed ring size.  That limits this.

> 
> > +
> > +After the capability is enabled, userspace can mmap the global ring
> > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > +
> > +Just like for dirty page bitmaps, the buffer tracks writes to
> > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> > +with the flag set, userspace can start harvesting dirty pages from the
> > +ring buffer.
> > +
> > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > +accordingly.  This can be done when the guest is running or paused,
> > +and dirty pages need not be collected all at once.  After processing
> > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > +fetch_index and to mark those pages clean.  Therefore, the ioctl
> > +must be called *before* reading the content of the dirty pages.
> 
> > +
> > +However, there is a major difference comparing to the
> > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > +userspace it's still possible that the kernel has not yet flushed the
> > +hardware dirty buffers into the kernel buffer.  To achieve that, one
> > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> 
> When you refer to "buffers", are you referring to the cache lines that
> contain the ring buffers, or to something else?
> 
> I'm a bit confused by this sentence. I think that you mean that a VCPU
> may still be running while you read its ring buffer, in which case the
> values in the ring buffer are not necessarily in memory yet, so not
> visible to a different CPU. But I wonder if you can't make this
> requirement to cause a vmexit unnecessary by carefully ordering the
> writes, to make sure that the fetch_index is updated only after the
> corresponding ring entries have been written to memory,
> 
> In other words, as seen by user-space, you would not care that the ring
> entries have not been flushed as long as the fetch_index itself is
> guaranteed to still be behind the not-flushed-yet entries.
> 
> (I would know how to do that on a different architecture, not sure for x86)

Sorry for not being clear, but.. Do you mean the "hardware dirty
buffers"?  For Intel, it could be PML.  Vmexits guarantee that even
PML buffers will be flushed to the dirty rings.  Nothing about cache
lines.

I used "hardware dirty buffer" only because this document is for KVM
in general, while PML is only one way to do such buffering.  I can add
"(for example, PML)" to make it clearer if you like.

> 
> > +
> > +If one of the ring buffers is full, the guest will exit to userspace
> > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > +should pause all the vcpus, then harvest all the dirty pages and
> > +rearm the dirty traps. It can unpause the guest after that.
> 
> Except for the condition above, why is it necessary to pause other VCPUs
> than the one being harvested?

This is a good question.  Paolo could correct me if I'm wrong.

Firstly I think this should rarely happen if the userspace is
collecting the dirty bits from time to time.  If it happens, we'll
need to call KVM_RESET_DIRTY_RINGS to reset all the rings.  Then the
question actually becomes to: Whether we'd like to have per-vcpu
KVM_RESET_DIRTY_RINGS?

So the answer is that it could be an overkill to do so.  The important
thing here is no matter what KVM_RESET_DIRTY_RINGS will need to change
the page tables and kick all VCPUs for TLB flushings.  If we must do
it, we'd better do it as rare as possible.  When we're with per-vcpu
ring resets, we'll do N*N vcpu kicks for the bad case (N kicks per
vcpu ring reset, and we've probably got N vcpus).  While if we stick
to the simple per-vm reset, it'll anyway kick all vcpus for tlb
flushing after all, then maybe it's easier to collect all of them
altogether and reset them altogether.

> 
> 
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index b19ef421084d..0acee817adfb 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> >  KVM := ../../../virt/kvm
> >
> >  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > +				$(KVM)/dirty_ring.o
> >  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> >
> >  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..8335635b7ff7
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,67 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/*
> > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > + *
> > + * dirty_ring:  shared with userspace via mmap. It is the compact list
> > + *              that holds the dirty pages.
> > + * dirty_index: free running counter that points to the next slot in
> > + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> > + * reset_index: free running counter that points to the next dirty page
> > + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> > + *              be reenabled
> > + * size:        size of the compact list, dirty_ring->dirty_gfns
> > + * soft_limit:  when the number of dirty pages in the list reaches this
> > + *              limit, vcpu that owns this ring should exit to userspace
> > + *              to allow userspace to harvest all the dirty pages
> > + * lock:        protects dirty_ring, only in use if this is the global
> > + *              ring
> 
> If that's not used for vcpu rings, maybe move it out of kvm_dirty_ring?

Yeah we can.

> 
> > + *
> > + * The number of dirty pages in the ring is calculated by,
> > + * dirty_index - reset_index
> 
> Nit: the code calls it "used" (in kvm_dirty_ring_used). Maybe find an
> unambiguous terminology. What about "posted", as in
> 
> The number of posted dirty pages, i.e. the number of dirty pages in the
> ring, is calculated as dirty_index - reset_index by function
> kvm_dirty_ring_posted
> 
> (Replace "posted" by any adjective of your liking)

Sure.

(Or maybe I'll just try to remove these lines to avoid introducing any
 terminology as long as it's not very necessary... and after all
 similar things will be mentioned in the documents, and the code itself)

> 
> > + *
> > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > + * is incremented. When userspace harvests the dirty pages, it increments
> > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > + * When kernel reenables dirty traps for the dirty pages, it increments
> > + * reset_index up to dirty_ring->indices.fetch_index.
> 
> Userspace should not be trusted to be doing this, see below.
> 
> 
> > + *
> > + */
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +
> > +/*
> > + * called with kvm->slots_lock held, returns the number of
> > + * processed pages.
> > + */
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes);
> > +
> > +/*
> > + * returns 0: successfully pushed
> > + *         1: successfully pushed, soft limit reached,
> > + *            vcpu should exit to userspace
> > + *         -EBUSY: unable to push, dirty ring full.
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock);
> > +
> > +/* for use in vm_operations_struct */
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> 
> Not very clear what 'i' means, seems to be a page offset based on call sites?

I'll rename it to "offset".

> 
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > +
> > +#endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 498a39462ac1..7b747bc9ff3e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -34,6 +34,7 @@
> >  #include <linux/kvm_types.h>
> >
> >  #include <asm/kvm_host.h>
> > +#include <linux/kvm_dirty_ring.h>
> >
> >  #ifndef KVM_MAX_VCPU_ID
> >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_PENDING_TIMER     2
> >  #define KVM_REQ_UNHALT            3
> > +#define KVM_REQ_DIRTY_RING_FULL   4
> >  #define KVM_REQUEST_ARCH_BASE     8
> >
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> >  	bool ready;
> >  	struct kvm_vcpu_arch arch;
> >  	struct dentry *debugfs_dentry;
> > +	struct kvm_dirty_ring dirty_ring;
> >  };
> >
> >  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > @@ -501,6 +504,10 @@ struct kvm {
> >  	struct srcu_struct srcu;
> >  	struct srcu_struct irq_srcu;
> >  	pid_t userspace_pid;
> > +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> > +	struct kvm_vm_run *vm_run;
> > +	u32 dirty_ring_size;
> > +	struct kvm_dirty_ring vm_dirty_ring;
> 
> If you remove the lock from struct kvm_dirty_ring, you could just put it there.

Ok.

> 
> >  };
> >
> >  #define kvm_err(fmt, ...) \
> > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >  					gfn_t gfn_offset,
> >  					unsigned long mask);
> >
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > +
> >  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> >  				struct kvm_dirty_log *log);
> >  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> >  				uintptr_t data, const char *name,
> >  				struct task_struct **thread_ptr);
> >
> > +/*
> > + * This defines how many reserved entries we want to keep before we
> > + * kick the vcpu to the userspace to avoid dirty ring full.  This
> > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > + */
> > +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> > +
> > +/* Max number of entries allowed for each kvm dirty ring */
> > +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > +
> > +/*
> > + * Arch needs to define these macro after implementing the dirty ring
> > + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > + * starting page offset of the dirty ring structures, while
> > + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> > + * feature is off on all archs.
> > + */
> > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > +#endif
> > +#ifndef KVM_DIRTY_RING_VERSION
> > +#define KVM_DIRTY_RING_VERSION 0
> > +#endif
> > +
> >  #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 1c88e69db3d9..d9d03eea145a 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> >  struct kvm_memory_slot;
> >  struct kvm_one_reg;
> >  struct kvm_run;
> > +struct kvm_vm_run;
> >  struct kvm_userspace_memory_region;
> >  struct kvm_vcpu;
> >  struct kvm_vcpu_init;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e6f17c8e2dba..0b88d76d6215 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> >  #define KVM_EXIT_IOAPIC_EOI       26
> >  #define KVM_EXIT_HYPERV           27
> >  #define KVM_EXIT_ARM_NISV         28
> > +#define KVM_EXIT_DIRTY_RING_FULL  29
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> >  /* Encounter unexpected vm-exit reason */
> >  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
> >
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> >  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> >  struct kvm_run {
> >  	/* in */
> > @@ -421,6 +427,13 @@ struct kvm_run {
> >  		struct kvm_sync_regs regs;
> >  		char padding[SYNC_REGS_SIZE_BYTES];
> >  	} s;
> > +
> > +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > +};
> > +
> > +/* Returned by mmap(kvm->fd, offset=0) */
> > +struct kvm_vm_run {
> > +	struct kvm_dirty_ring_indexes vm_ring_indexes;
> >  };
> >
> >  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> >  #define KVM_CAP_ARM_NISV_TO_USER 177
> >  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > +#define KVM_CAP_DIRTY_LOG_RING 179
> >
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> >  /* Available with KVM_CAP_ARM_SVE */
> >  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
> >
> > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> > +
> >  /* Secure Encrypted Virtualization command */
> >  enum sev_cmd_id {
> >  	/* Guest initialization commands */
> > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> >  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
> >  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
> >
> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + *    of kvm_write_* so that the global dirty ring is not filled up
> > + *    too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + *    enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + *    dirty bitmap with KVM's.
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > +	__u32 pad;
> > +	__u32 slot;
> > +	__u64 offset;
> > +};
> > +
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 size = kvm->dirty_ring_size;
> > +
> > +	ring->dirty_gfns = vmalloc(size);
> > +	if (!ring->dirty_gfns)
> > +		return -ENOMEM;
> > +	memset(ring->dirty_gfns, 0, size);
> > +
> > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > +	ring->soft_limit =
> > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > +	    kvm_dirty_ring_get_rsvd_entries();
> 
> Minor, but what about
> 
>        ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();

Yeah it's better.

> 
> 
> > +	ring->dirty_index = 0;
> > +	ring->reset_index = 0;
> > +	spin_lock_init(&ring->lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes)
> > +{
> > +	u32 cur_slot, next_slot;
> > +	u64 cur_offset, next_offset;
> > +	unsigned long mask;
> > +	u32 fetch;
> > +	int count = 0;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	fetch = READ_ONCE(indexes->fetch_index);
> 
> If I understand correctly, if a malicious user-space writes
> ring->reset_index + 1 into fetch_index, the loop below will execute 4
> billion times.
> 
> 
> > +	if (fetch == ring->reset_index)
> > +		return 0;
> 
> To protect against scenario above, I would have something like:
> 
> 	if (fetch - ring->reset_index >= ring->size)
> 		return -EINVAL;

Good point...  Actually I've got this in my latest branch already, but
still thanks for noticing this!

> 
> > +
> > +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +	/*
> > +	 * The ring buffer is shared with userspace, which might mmap
> > +	 * it and concurrently modify slot and offset.  Userspace must
> > +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> > +	 * the values after they've been range-checked (the checks are
> > +	 * in kvm_reset_dirty_gfn).
> > +	 */
> > +	smp_read_barrier_depends();
> > +	cur_slot = READ_ONCE(entry->slot);
> > +	cur_offset = READ_ONCE(entry->offset);
> > +	mask = 1;
> > +	count++;
> > +	ring->reset_index++;

[1]

> > +	while (ring->reset_index != fetch) {
> > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +		smp_read_barrier_depends();
> > +		next_slot = READ_ONCE(entry->slot);
> > +		next_offset = READ_ONCE(entry->offset);
> > +		ring->reset_index++;
> > +		count++;
> > +		/*
> > +		 * Try to coalesce the reset operations when the guest is
> > +		 * scanning pages in the same slot.
> > +		 */
> > +		if (next_slot == cur_slot) {
> > +			int delta = next_offset - cur_offset;
> 
> Since you diff two u64, shouldn't that be an i64 rather than int?

I found there's no i64, so I'm using "long long".

> 
> > +
> > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > +				mask |= 1ull << delta;
> > +				continue;
> > +			}
> > +
> > +			/* Backwards visit, careful about overflows!  */
> > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > +			    (mask << -delta >> -delta) == mask) {
> > +				cur_offset = next_offset;
> > +				mask = (mask << -delta) | 1;
> > +				continue;
> > +			}
> > +		}
> > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +		cur_slot = next_slot;
> > +		cur_offset = next_offset;
> > +		mask = 1;
> > +	}
> > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> 
> So if you did not coalesce the last one, you call kvm_reset_dirty_gfn
> twice? Something smells weird about this loop ;-) I have a gut feeling
> that it could be done in a single while loop combined with the entry
> test, but I may be wrong.

It should be easy to save a few lines at [1] by introducing a boolean
"first_round".  I don't see it easy to avoid the kvm_reset_dirty_gfn()
call at the end though...

> 
> 
> > +
> > +	return count;
> > +}
> > +
> > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > +{
> > +	return ring->dirty_index - ring->reset_index;
> > +}
> > +
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > +{
> > +	return kvm_dirty_ring_used(ring) >= ring->size;
> > +}
> > +
> > +/*
> > + * Returns:
> > + *   >0 if we should kick the vcpu out,
> > + *   =0 if the gfn pushed successfully, or,
> > + *   <0 if error (e.g. ring full)
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock)
> 
> Obviously, if you go with the suggestion to have a "lock" only in struct
> kvm, then you'd have to pass a lock ptr instead of a bool.

Paolo got a better solution on that.  That "lock" will be dropped.

> 
> > +{
> > +	int ret;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	if (lock)
> > +		spin_lock(&ring->lock);
> > +
> > +	if (kvm_dirty_ring_full(ring)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > +	entry->slot = slot;
> > +	entry->offset = offset;
> > +	smp_wmb();
> > +	ring->dirty_index++;
> > +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> 
> Following up on comment about having to vmexit other VCPUs above:
> If you have a write barrier for the entry, and then a write once for the
> index, isn't that sufficient to ensure that another CPU will pick up the
> right values in the right order?

I think so.  I've replied above on the RESET issue.

> 
> 
> > +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > +	pr_info("%s: slot %u offset %llu used %u\n",
> > +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> > +
> > +out:
> > +	if (lock)
> > +		spin_unlock(&ring->lock);
> > +
> > +	return ret;
> > +}
> > +
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> 
> Still don't like 'i' :-)
> 
> 
> (Stopped my review here for lack of time, decided to share what I had so far)

Thanks for your comments!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-13 20:23     ` Peter Xu
@ 2019-12-14  7:57       ` Paolo Bonzini
  2019-12-14 16:26         ` Peter Xu
  2019-12-17 12:16         ` Christophe de Dinechin
  2019-12-20 18:19       ` Peter Xu
  1 sibling, 2 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-14  7:57 UTC (permalink / raw)
  To: Peter Xu, Christophe de Dinechin
  Cc: linux-kernel, kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 13/12/19 21:23, Peter Xu wrote:
>> What is the benefit of using u16 for that? That means with 4K pages, you
>> can share at most 256M of dirty memory each time? That seems low to me,
>> especially since it's sufficient to touch one byte in a page to dirty it.
>>
>> Actually, this is not consistent with the definition in the code ;-)
>> So I'll assume it's actually u32.
> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> more. :)

It has to be u16, because it overlaps the padding of the first entry.

Paolo

> I think even u16 would be mostly enough (if you see, the maximum
> allowed value currently is 64K entries only, not a big one).  Again,
> the thing is that the userspace should be collecting the dirty bits,
> so the ring shouldn't reach full easily.  Even if it does, we should
> probably let it stop for a while as explained above.  It'll be
> inefficient only if we set it to a too-small value, imho.
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-14  7:57       ` Paolo Bonzini
@ 2019-12-14 16:26         ` Peter Xu
  2019-12-16  9:29           ` Paolo Bonzini
  2019-12-17 12:16         ` Christophe de Dinechin
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-14 16:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
> On 13/12/19 21:23, Peter Xu wrote:
> >> What is the benefit of using u16 for that? That means with 4K pages, you
> >> can share at most 256M of dirty memory each time? That seems low to me,
> >> especially since it's sufficient to touch one byte in a page to dirty it.
> >>
> >> Actually, this is not consistent with the definition in the code ;-)
> >> So I'll assume it's actually u32.
> > Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> > more. :)
> 
> It has to be u16, because it overlaps the padding of the first entry.

Hmm, could you explain?

Note that here what Christophe commented is on dirty_index,
reset_index of "struct kvm_dirty_ring", so imho it could really be
anything we want as long as it can store a u32 (which is the size of
the elements in kvm_dirty_ring_indexes).

If you were instead talking about the previous union definition of
"struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
moved those indices out of it and defined kvm_dirty_ring_indexes which
we expose via kvm_run, so we don't have that limitation as well any
more?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-10 17:09                     ` Paolo Bonzini
@ 2019-12-15 17:21                       ` Peter Xu
  2019-12-16 10:08                         ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-15 17:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Tue, Dec 10, 2019 at 06:09:02PM +0100, Paolo Bonzini wrote:
> On 10/12/19 16:52, Peter Xu wrote:
> > On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
> >>> I'm thinking whether I can start
> >>> to use this information in the next post on solving an issue I
> >>> encountered with the waitqueue.
> >>>
> >>> Current waitqueue is still problematic in that it could wait even with
> >>> the mmu lock held when with vcpu context.
> >>
> >> I think the idea of the soft limit is that the waiting just cannot
> >> happen.  That is, the number of dirtied pages _outside_ the guest (guest
> >> accesses are taken care of by PML, and are subtracted from the soft
> >> limit) cannot exceed hard_limit - (soft_limit + pml_size).
> > 
> > So the question go backs to, whether this is guaranteed somehow?  Or
> > do you prefer us to keep the warn_on_once until it triggers then we
> > can analyze (which I doubt..)?
> 
> Yes, I would like to keep the WARN_ON_ONCE just because you never know.
> 
> Of course it would be much better to audit the calls to kvm_write_guest
> and figure out how many could trigger (e.g. two from the operands of an
> emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.).

I would say we'd better either figure out all the caller's sites to
prove it will never overflow, or, I think we'll need the waitqueue at
least.  The problem is if we release a kvm with WARN_ON_ONCE and at
last we found that it can be triggered and ring full can't be avoided,
then it means the interface and design is broken, and it could even be
too late to fix it after the interface is published.

(Actually I was not certain on previous clear_dirty interface where we
 introduced a new capability for it.  I'm not sure whether that can be
 avoided because after all the initial version is not working at all,
 and we fixed it up without changing the interface.  However for this
 one if at last we prove the design wrong, then we must introduce
 another capability for it IMHO, and the interface is prone to change
 too)

So, with the hope that we could avoid the waitqueue, I checked all the
callers of mark_page_dirty_in_slot().  Since this initial work is only
for x86, I didn't look more into other archs, assuming that can be
done later when it is implemented for other archs (and this will for
sure also cover the common code):

    mark_page_dirty_in_slot calls, per-vm (x86 only)
        __kvm_write_guest_page
            kvm_write_guest_page
                init_rmode_tss
                    vmx_set_tss_addr
                        kvm_vm_ioctl_set_tss_addr [*]
                init_rmode_identity_map
                    vmx_create_vcpu [*]
                vmx_write_pml_buffer
                    kvm_arch_write_log_dirty [&]
                kvm_write_guest
                    kvm_hv_setup_tsc_page
                        kvm_guest_time_update [&]
                    nested_flush_cached_shadow_vmcs12 [&]
                    kvm_write_wall_clock [&]
                    kvm_pv_clock_pairing [&]
                    kvmgt_rw_gpa [?]
                    kvm_write_guest_offset_cached
                        kvm_steal_time_set_preempted [&]
                        kvm_write_guest_cached
                            pv_eoi_put_user [&]
                            kvm_lapic_sync_to_vapic [&]
                            kvm_setup_pvclock_page [&]
                            record_steal_time [&]
                            apf_put_user [&]
                kvm_clear_guest_page
                    init_rmode_tss [*] (see above)
                    init_rmode_identity_map [*] (see above)
                    kvm_clear_guest
                        synic_set_msr
                            kvm_hv_set_msr [&]
        kvm_write_guest_offset_cached [&] (see above)
        mark_page_dirty
            kvm_hv_set_msr_pw [&]

We should only need to look at the leaves of the traces because
they're where the dirty request starts.  I'm marking all the leaves
with below criteria then it'll be easier to focus:

Cases with [*]: should not matter much
           [&]: actually with a per-vcpu context in the upper layer
           [?]: uncertain...

I'm a bit amazed after I took these notes, since I found that besides
those that could probbaly be ignored (marked as [*]), most of the rest
per-vm dirty requests are actually with a vcpu context.

Although now because we have kvm_get_running_vcpu() all cases for [&]
should be fine without changing anything, but I tend to add another
patch in the next post to convert all the [&] cases explicitly to pass
vcpu pointer instead of kvm pointer to be clear if no one disagrees,
then we verify that against kvm_get_running_vcpu().

So the only uncertainty now is kvmgt_rw_gpa() which is marked as [?].
Could this happen frequently?  I would guess the answer is we don't
know (which means it can).

> 
> > One thing to mention is that for with-vcpu cases, we probably can even
> > stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
> > reaches the softlimit, then for vcpu case it should be easier to
> > guarantee that.  What I want to know is the rest of cases like ioctls
> > or even something not from the userspace (which I think I should read
> > more later..).
> 
> Which ioctls?  Most ioctls shouldn't dirty memory at all.

init_rmode_tss or init_rmode_identity_map.  But I've marked them as
unimportant because they should only happen once at boot.

> 
> >>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> >>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> >>> cases we'll use per-vm dirty ring) then it's probably fine.
> >>>
> >>> My planned solution:
> >>>
> >>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> >>>   until we finished handling this page fault, probably in somewhere
> >>>   around vcpu_enter_guest, so that we can do wait_event() after the
> >>>   mmu lock released
> >>
> >> I think this can cause a race:
> >>
> >> 	vCPU 1			vCPU 2		host
> >> 	---------------------------------------------------------------
> >> 	mark page dirty
> >> 				write to page
> >> 						treat page as not dirty
> >> 	add page to ring
> >>
> >> where vCPU 2 skips the clean-page slow path entirely.
> > 
> > If we're still with the rule in userspace that we first do RESET then
> > collect and send the pages (just like what we've discussed before),
> > then IMHO it's fine to have vcpu2 to skip the slow path?  Because
> > RESET happens at "treat page as not dirty", then if we are sure that
> > we only collect and send pages after that point, then the latest
> > "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> > blocked by vcpu1's ring full?
> 
> Good point, the race would become
> 
>  	vCPU 1			vCPU 2		host
>  	---------------------------------------------------------------
>  	mark page dirty
>  				write to page
> 						reset rings
> 						  wait for mmu lock
>  	add page to ring
> 	release mmu lock
> 						  ...do reset...
> 						  release mmu lock
> 						page is now dirty

Hmm, the page will be dirty after the reset, but is that an issue?

Or, could you help me to identify what I've missed?

> 
> > Maybe we can also consider to let mark_page_dirty_in_slot() return a
> > value, then the upper layer could have a chance to skip the spte
> > update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
> > can return directly with RET_PF_RETRY.
> 
> I don't think that's possible, most writes won't come from a page fault
> path and cannot retry.

Yep, maybe I should say it in the other way round: we only wait if
kvm_get_running_vcpu() == NULL.  Then in somewhere near
vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
Would that work?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-12  0:08         ` Paolo Bonzini
  2019-12-12  7:36           ` Michael S. Tsirkin
@ 2019-12-15 17:33           ` Peter Xu
  2019-12-16  9:47             ` Michael S. Tsirkin
  1 sibling, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-15 17:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>> What depends on what here? Looks suspicious ...
> >>
> >> Hmm, I think maybe it can be removed because the entry pointer
> >> reference below should be an ordering constraint already?
> 
> entry->xxx depends on ring->reset_index.

Yes that's true, but...

        entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
        /* barrier? */
        next_slot = READ_ONCE(entry->slot);
        next_offset = READ_ONCE(entry->offset);

... I think entry->xxx depends on entry first, then entry depends on
reset_index.  So it seems fine because all things have a dependency?

> 
> >>> what's the story around locking here? Why is it safe
> >>> not to take the lock sometimes?
> >>
> >> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> >> ring is used.  For per-vcpu ring, because that will only happen with
> >> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> >> is called with lock==false).
> 
> FWIW this will be done much more nicely in v2.
> 
> >>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>>> +	if (!page) {
> >>>> +		r = -ENOMEM;
> >>>> +		goto out_err_alloc_page;
> >>>> +	}
> >>>> +	kvm->vm_run = page_address(page);
> >>>
> >>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> >>> still. What is wrong with just a pointer and calling put_user?
> >>
> >> I want to make it the start point for sharing fields between
> >> user/kernel per-vm.  Just like kvm_run for per-vcpu.
> 
> This page is actually not needed at all.  Userspace can just map at
> KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
> kvm_vm_run completely.

I changed it because otherwise we use one entry of the padding, and
all the rest of paddings are a waste of memory because we can never
really use the padding as new fields only for the 1st entry which
overlaps with the indices.  IMHO that could even waste more than 4k.

(for now we only "waste" 4K for per-vm, kvm_run is already mapped so
 no waste there, not to say potentially I still think we can use the
 kvm_vm_run in the future)

> 
> >>>> +	} else {
> >>>> +		/*
> >>>> +		 * Put onto per vm ring because no vcpu context.  Kick
> >>>> +		 * vcpu0 if ring is full.
> >>>
> >>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> >>> critical tasks there, they will be penalized disproportionally?
> >>
> >> Reasonable question.  So far we can't avoid it because vcpu exit is
> >> the event mechanism to say "hey please collect dirty bits".  Maybe
> >> someway is better than this, but I'll need to rethink all these
> >> over...
> > 
> > Maybe signal an eventfd, and let userspace worry about deciding what to
> > do.
> 
> This has to be done synchronously.  But the vm ring should be used very
> rarely (it's for things like kvmclock updates that write to guest memory
> outside a vCPU), possibly a handful of times in the whole run of the VM.

I've summarized a list of callers who might dirty guest memory in the
other thread, it seems to me that even the kvm clock is using per-vcpu
contexts.

> 
> >>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> >>> So how does userspace know what's legal?
> >>> Do you expect it to just try?
> >>
> >> Yep that's what I thought. :)
> 
> We should return it for KVM_CHECK_EXTENSION.

OK.  I'll drop the versioning.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-14 16:26         ` Peter Xu
@ 2019-12-16  9:29           ` Paolo Bonzini
  2019-12-16 15:26             ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-16  9:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe de Dinechin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On 14/12/19 17:26, Peter Xu wrote:
> On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
>> On 13/12/19 21:23, Peter Xu wrote:
>>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>>> can share at most 256M of dirty memory each time? That seems low to me,
>>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>>>
>>>> Actually, this is not consistent with the definition in the code ;-)
>>>> So I'll assume it's actually u32.
>>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>>> more. :)
>>
>> It has to be u16, because it overlaps the padding of the first entry.
> 
> Hmm, could you explain?
> 
> Note that here what Christophe commented is on dirty_index,
> reset_index of "struct kvm_dirty_ring", so imho it could really be
> anything we want as long as it can store a u32 (which is the size of
> the elements in kvm_dirty_ring_indexes).
> 
> If you were instead talking about the previous union definition of
> "struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
> moved those indices out of it and defined kvm_dirty_ring_indexes which
> we expose via kvm_run, so we don't have that limitation as well any
> more?

Yeah, I meant that since the size has (had) to be u16 in the union, it
need not be bigger in kvm_dirty_ring.

I don't think having more than 2^16 entries in the *per-CPU* ring buffer
makes sense; lagging in recording dirty memory by more than 256 MiB per
CPU would mean a large pause later on resetting the ring buffers (your
KVM_CLEAR_DIRTY_LOG patches found the sweet spot to be around 1 GiB for
the whole system).

So I liked the union, but if you removed it you might as well align the
producer and consumer indices to 64 bytes so that they are in separate
cache lines.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-15 17:33           ` Peter Xu
@ 2019-12-16  9:47             ` Michael S. Tsirkin
  2019-12-16 15:07               ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-16  9:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > >>> What depends on what here? Looks suspicious ...
> > >>
> > >> Hmm, I think maybe it can be removed because the entry pointer
> > >> reference below should be an ordering constraint already?
> > 
> > entry->xxx depends on ring->reset_index.
> 
> Yes that's true, but...
> 
>         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>         /* barrier? */
>         next_slot = READ_ONCE(entry->slot);
>         next_offset = READ_ONCE(entry->offset);
> 
> ... I think entry->xxx depends on entry first, then entry depends on
> reset_index.  So it seems fine because all things have a dependency?

Is reset_index changed from another thread then?
If yes then you want to read reset_index with READ_ONCE.
That includes a dependency barrier.

> > 
> > >>> what's the story around locking here? Why is it safe
> > >>> not to take the lock sometimes?
> > >>
> > >> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> > >> ring is used.  For per-vcpu ring, because that will only happen with
> > >> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> > >> is called with lock==false).
> > 
> > FWIW this will be done much more nicely in v2.
> > 
> > >>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > >>>> +	if (!page) {
> > >>>> +		r = -ENOMEM;
> > >>>> +		goto out_err_alloc_page;
> > >>>> +	}
> > >>>> +	kvm->vm_run = page_address(page);
> > >>>
> > >>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> > >>> still. What is wrong with just a pointer and calling put_user?
> > >>
> > >> I want to make it the start point for sharing fields between
> > >> user/kernel per-vm.  Just like kvm_run for per-vcpu.
> > 
> > This page is actually not needed at all.  Userspace can just map at
> > KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
> > kvm_vm_run completely.
> 
> I changed it because otherwise we use one entry of the padding, and
> all the rest of paddings are a waste of memory because we can never
> really use the padding as new fields only for the 1st entry which
> overlaps with the indices.  IMHO that could even waste more than 4k.
> 
> (for now we only "waste" 4K for per-vm, kvm_run is already mapped so
>  no waste there, not to say potentially I still think we can use the
>  kvm_vm_run in the future)
> 
> > 
> > >>>> +	} else {
> > >>>> +		/*
> > >>>> +		 * Put onto per vm ring because no vcpu context.  Kick
> > >>>> +		 * vcpu0 if ring is full.
> > >>>
> > >>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> > >>> critical tasks there, they will be penalized disproportionally?
> > >>
> > >> Reasonable question.  So far we can't avoid it because vcpu exit is
> > >> the event mechanism to say "hey please collect dirty bits".  Maybe
> > >> someway is better than this, but I'll need to rethink all these
> > >> over...
> > > 
> > > Maybe signal an eventfd, and let userspace worry about deciding what to
> > > do.
> > 
> > This has to be done synchronously.  But the vm ring should be used very
> > rarely (it's for things like kvmclock updates that write to guest memory
> > outside a vCPU), possibly a handful of times in the whole run of the VM.
> 
> I've summarized a list of callers who might dirty guest memory in the
> other thread, it seems to me that even the kvm clock is using per-vcpu
> contexts.
> 
> > 
> > >>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> > >>> So how does userspace know what's legal?
> > >>> Do you expect it to just try?
> > >>
> > >> Yep that's what I thought. :)
> > 
> > We should return it for KVM_CHECK_EXTENSION.
> 
> OK.  I'll drop the versioning.
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-15 17:21                       ` Peter Xu
@ 2019-12-16 10:08                         ` Paolo Bonzini
  2019-12-16 18:54                           ` Peter Xu
                                             ` (2 more replies)
  0 siblings, 3 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-16 10:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

[Alex and Kevin: there are doubts below regarding dirty page tracking
from VFIO and mdev devices, which perhaps you can help with]

On 15/12/19 18:21, Peter Xu wrote:
>                 init_rmode_tss
>                     vmx_set_tss_addr
>                         kvm_vm_ioctl_set_tss_addr [*]
>                 init_rmode_identity_map
>                     vmx_create_vcpu [*]

These don't matter because their content is not visible to userspace
(the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d

>                 vmx_write_pml_buffer
>                     kvm_arch_write_log_dirty [&]
>                 kvm_write_guest
>                     kvm_hv_setup_tsc_page
>                         kvm_guest_time_update [&]
>                     nested_flush_cached_shadow_vmcs12 [&]
>                     kvm_write_wall_clock [&]
>                     kvm_pv_clock_pairing [&]
>                     kvmgt_rw_gpa [?]

This then expands (partially) to

intel_gvt_hypervisor_write_gpa
    emulate_csb_update
        emulate_execlist_ctx_schedule_out
            complete_execlist_workload
                complete_current_workload
                     workload_thread
        emulate_execlist_ctx_schedule_in
            prepare_execlist_workload
                prepare_workload
                    dispatch_workload
                        workload_thread

So KVMGT is always writing to GPAs instead of IOVAs and basically
bypassing a guest IOMMU.  So here it would be better if kvmgt was
changed not use kvm_write_guest (also because I'd probably have nacked
that if I had known :)).

As far as I know, there is some work on live migration with both VFIO
and mdev, and that probably includes some dirty page tracking API.
kvmgt could switch to that API, or there could be VFIO APIs similar to
kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
tracking of writes from mdev devices.  Kevin, are these writes used in
any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
writes from kvmgt vGPUs, or can the hardware write to memory as well
(which would be my guess if I didn't know anything about kvmgt, which I
pretty much don't)?

> We should only need to look at the leaves of the traces because
> they're where the dirty request starts.  I'm marking all the leaves
> with below criteria then it'll be easier to focus:
> 
> Cases with [*]: should not matter much
>            [&]: actually with a per-vcpu context in the upper layer
>            [?]: uncertain...
> 
> I'm a bit amazed after I took these notes, since I found that besides
> those that could probbaly be ignored (marked as [*]), most of the rest
> per-vm dirty requests are actually with a vcpu context.
> 
> Although now because we have kvm_get_running_vcpu() all cases for [&]
> should be fine without changing anything, but I tend to add another
> patch in the next post to convert all the [&] cases explicitly to pass
> vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> then we verify that against kvm_get_running_vcpu().

This is a good idea but remember not to convert those to
kvm_vcpu_write_guest, because you _don't_ want these writes to touch
SMRAM (most of the addresses are OS-controlled rather than
firmware-controlled).

> init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> unimportant because they should only happen once at boot.

We need to check if userspace can add an arbitrary number of entries by
calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.

>>> If we're still with the rule in userspace that we first do RESET then
>>> collect and send the pages (just like what we've discussed before),
>>> then IMHO it's fine to have vcpu2 to skip the slow path?  Because
>>> RESET happens at "treat page as not dirty", then if we are sure that
>>> we only collect and send pages after that point, then the latest
>>> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
>>> blocked by vcpu1's ring full?
>>
>> Good point, the race would become
>>
>>  	vCPU 1			vCPU 2		host
>>  	---------------------------------------------------------------
>>  	mark page dirty
>>  				write to page
>> 						reset rings
>> 						  wait for mmu lock
>>  	add page to ring
>> 	release mmu lock
>> 						  ...do reset...
>> 						  release mmu lock
>> 						page is now dirty
> 
> Hmm, the page will be dirty after the reset, but is that an issue?
> 
> Or, could you help me to identify what I've missed?

Nothing: the race is always solved in such a way that there's no issue.

>> I don't think that's possible, most writes won't come from a page fault
>> path and cannot retry.
> 
> Yep, maybe I should say it in the other way round: we only wait if
> kvm_get_running_vcpu() == NULL.  Then in somewhere near
> vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
> Would that work?

Yes, that should work, especially if we know that kvmgt is the only case
that can wait.  And since:

1) kvmgt doesn't really need dirty page tracking (because VFIO devices
generally don't track dirty pages, and because kvmgt shouldn't be using
kvm_write_guest anyway)

2) the real mode TSS and identity map shouldn't even be tracked, as they
are invisible to userspace

it seems to me that kvm_get_running_vcpu() lets us get rid of the per-VM
ring altogether.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16  9:47             ` Michael S. Tsirkin
@ 2019-12-16 15:07               ` Peter Xu
  2019-12-16 15:33                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-16 15:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Mon, Dec 16, 2019 at 04:47:36AM -0500, Michael S. Tsirkin wrote:
> On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > > >>> What depends on what here? Looks suspicious ...
> > > >>
> > > >> Hmm, I think maybe it can be removed because the entry pointer
> > > >> reference below should be an ordering constraint already?
> > > 
> > > entry->xxx depends on ring->reset_index.
> > 
> > Yes that's true, but...
> > 
> >         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> >         /* barrier? */
> >         next_slot = READ_ONCE(entry->slot);
> >         next_offset = READ_ONCE(entry->offset);
> > 
> > ... I think entry->xxx depends on entry first, then entry depends on
> > reset_index.  So it seems fine because all things have a dependency?
> 
> Is reset_index changed from another thread then?
> If yes then you want to read reset_index with READ_ONCE.
> That includes a dependency barrier.

There're a few readers, but only this function will change it
(kvm_dirty_ring_reset).  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16  9:29           ` Paolo Bonzini
@ 2019-12-16 15:26             ` Peter Xu
  2019-12-16 15:31               ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-16 15:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Mon, Dec 16, 2019 at 10:29:36AM +0100, Paolo Bonzini wrote:
> On 14/12/19 17:26, Peter Xu wrote:
> > On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
> >> On 13/12/19 21:23, Peter Xu wrote:
> >>>> What is the benefit of using u16 for that? That means with 4K pages, you
> >>>> can share at most 256M of dirty memory each time? That seems low to me,
> >>>> especially since it's sufficient to touch one byte in a page to dirty it.
> >>>>
> >>>> Actually, this is not consistent with the definition in the code ;-)
> >>>> So I'll assume it's actually u32.
> >>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> >>> more. :)
> >>
> >> It has to be u16, because it overlaps the padding of the first entry.
> > 
> > Hmm, could you explain?
> > 
> > Note that here what Christophe commented is on dirty_index,
> > reset_index of "struct kvm_dirty_ring", so imho it could really be
> > anything we want as long as it can store a u32 (which is the size of
> > the elements in kvm_dirty_ring_indexes).
> > 
> > If you were instead talking about the previous union definition of
> > "struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
> > moved those indices out of it and defined kvm_dirty_ring_indexes which
> > we expose via kvm_run, so we don't have that limitation as well any
> > more?
> 
> Yeah, I meant that since the size has (had) to be u16 in the union, it
> need not be bigger in kvm_dirty_ring.
> 
> I don't think having more than 2^16 entries in the *per-CPU* ring buffer
> makes sense; lagging in recording dirty memory by more than 256 MiB per
> CPU would mean a large pause later on resetting the ring buffers (your
> KVM_CLEAR_DIRTY_LOG patches found the sweet spot to be around 1 GiB for
> the whole system).

That's right, 1G could probably be a "common flavor" for guests in
that case.

Though I wanted to use u64 only because I wanted to prepare even
better for future potential changes as long as it won't hurt much.
Here I'm just afraid 16bit might not be big enough for this 64bit
world, at the meantime I'd confess some of the requirement could be
really unimaginable before we know it..  I'm trying to forge one here:
what if the customer wants to handle 4G burst dirtying workload during
a migration (besides the burst IOs, mostly idle guests), while the
customer also want good responsiveness during the burst dirtying?  In
that case even if we use 256MiB ring we'll still need to freqently
pause for the harvesting, but actually this case really suites for a
8G ring size.

My example could be nonsense actually, just to show that if we can
extend something to u64 from u16 without paying much, then why not. :-)

> 
> So I liked the union, but if you removed it you might as well align the
> producer and consumer indices to 64 bytes so that they are in separate
> cache lines.

Yeh that I can do.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 15:26             ` Peter Xu
@ 2019-12-16 15:31               ` Paolo Bonzini
  2019-12-16 15:43                 ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-16 15:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe de Dinechin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On 16/12/19 16:26, Peter Xu wrote:
> On Mon, Dec 16, 2019 at 10:29:36AM +0100, Paolo Bonzini wrote:
>> On 14/12/19 17:26, Peter Xu wrote:
>>> On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
>>>> On 13/12/19 21:23, Peter Xu wrote:
>>>>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>>>>> can share at most 256M of dirty memory each time? That seems low to me,
>>>>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>>>>>
>>>>>> Actually, this is not consistent with the definition in the code ;-)
>>>>>> So I'll assume it's actually u32.
>>>>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>>>>> more. :)
>>>>
>>>> It has to be u16, because it overlaps the padding of the first entry.
>>>
>>> Hmm, could you explain?
>>>
>>> Note that here what Christophe commented is on dirty_index,
>>> reset_index of "struct kvm_dirty_ring", so imho it could really be
>>> anything we want as long as it can store a u32 (which is the size of
>>> the elements in kvm_dirty_ring_indexes).
>>>
>>> If you were instead talking about the previous union definition of
>>> "struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
>>> moved those indices out of it and defined kvm_dirty_ring_indexes which
>>> we expose via kvm_run, so we don't have that limitation as well any
>>> more?
>>
>> Yeah, I meant that since the size has (had) to be u16 in the union, it
>> need not be bigger in kvm_dirty_ring.
>>
>> I don't think having more than 2^16 entries in the *per-CPU* ring buffer
>> makes sense; lagging in recording dirty memory by more than 256 MiB per
>> CPU would mean a large pause later on resetting the ring buffers (your
>> KVM_CLEAR_DIRTY_LOG patches found the sweet spot to be around 1 GiB for
>> the whole system).
> 
> That's right, 1G could probably be a "common flavor" for guests in
> that case.
> 
> Though I wanted to use u64 only because I wanted to prepare even
> better for future potential changes as long as it won't hurt much.

No u64, please.  u32 I can agree with, 16-bit *should* be enough but it
is a bit tight, so let's make it 32-bit if we drop the union idea.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 15:07               ` Peter Xu
@ 2019-12-16 15:33                 ` Michael S. Tsirkin
  2019-12-16 15:47                   ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Michael S. Tsirkin @ 2019-12-16 15:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Mon, Dec 16, 2019 at 10:07:54AM -0500, Peter Xu wrote:
> On Mon, Dec 16, 2019 at 04:47:36AM -0500, Michael S. Tsirkin wrote:
> > On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> > > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > > > >>> What depends on what here? Looks suspicious ...
> > > > >>
> > > > >> Hmm, I think maybe it can be removed because the entry pointer
> > > > >> reference below should be an ordering constraint already?
> > > > 
> > > > entry->xxx depends on ring->reset_index.
> > > 
> > > Yes that's true, but...
> > > 
> > >         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > >         /* barrier? */
> > >         next_slot = READ_ONCE(entry->slot);
> > >         next_offset = READ_ONCE(entry->offset);
> > > 
> > > ... I think entry->xxx depends on entry first, then entry depends on
> > > reset_index.  So it seems fine because all things have a dependency?
> > 
> > Is reset_index changed from another thread then?
> > If yes then you want to read reset_index with READ_ONCE.
> > That includes a dependency barrier.
> 
> There're a few readers, but only this function will change it
> (kvm_dirty_ring_reset).  Thanks,

Then you don't need any barriers in this function.
readers need at least READ_ONCE.

> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 15:31               ` Paolo Bonzini
@ 2019-12-16 15:43                 ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-16 15:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Mon, Dec 16, 2019 at 04:31:50PM +0100, Paolo Bonzini wrote:
> No u64, please.  u32 I can agree with, 16-bit *should* be enough but it
> is a bit tight, so let's make it 32-bit if we drop the union idea.

Sure.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 15:33                 ` Michael S. Tsirkin
@ 2019-12-16 15:47                   ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-16 15:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, linux-kernel, kvm, Sean Christopherson,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Mon, Dec 16, 2019 at 10:33:42AM -0500, Michael S. Tsirkin wrote:
> On Mon, Dec 16, 2019 at 10:07:54AM -0500, Peter Xu wrote:
> > On Mon, Dec 16, 2019 at 04:47:36AM -0500, Michael S. Tsirkin wrote:
> > > On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> > > > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > > > > >>> What depends on what here? Looks suspicious ...
> > > > > >>
> > > > > >> Hmm, I think maybe it can be removed because the entry pointer
> > > > > >> reference below should be an ordering constraint already?
> > > > > 
> > > > > entry->xxx depends on ring->reset_index.
> > > > 
> > > > Yes that's true, but...
> > > > 
> > > >         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > >         /* barrier? */
> > > >         next_slot = READ_ONCE(entry->slot);
> > > >         next_offset = READ_ONCE(entry->offset);
> > > > 
> > > > ... I think entry->xxx depends on entry first, then entry depends on
> > > > reset_index.  So it seems fine because all things have a dependency?
> > > 
> > > Is reset_index changed from another thread then?
> > > If yes then you want to read reset_index with READ_ONCE.
> > > That includes a dependency barrier.
> > 
> > There're a few readers, but only this function will change it
> > (kvm_dirty_ring_reset).  Thanks,
> 
> Then you don't need any barriers in this function.
> readers need at least READ_ONCE.

In our case even an old reset_index should not matter much here imho
because the worst case is we read an old reset so we stop pushing to a
ring when it's just being reset and at the same time it's soft-full
(so an extra user exit even race happened).  But I agree it's clearer
to READ_ONCE() on readers.  Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 10:08                         ` Paolo Bonzini
@ 2019-12-16 18:54                           ` Peter Xu
  2019-12-17  9:01                             ` Paolo Bonzini
  2019-12-17  2:28                           ` Tian, Kevin
       [not found]                           ` <AADFC41AFE54684AB9EE6CBC0274A5D19D645E5F@SHSMSX104.ccr.corp.intel.com>
  2 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-16 18:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On Mon, Dec 16, 2019 at 11:08:15AM +0100, Paolo Bonzini wrote:
> > Although now because we have kvm_get_running_vcpu() all cases for [&]
> > should be fine without changing anything, but I tend to add another
> > patch in the next post to convert all the [&] cases explicitly to pass
> > vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> > then we verify that against kvm_get_running_vcpu().
> 
> This is a good idea but remember not to convert those to
> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
> SMRAM (most of the addresses are OS-controlled rather than
> firmware-controlled).

OK.  I think I only need to pass in vcpu* instead of kvm* in
kvm_write_guest_page() just like kvm_vcpu_write_guest(), however we
still keep to only write to address space id==0 for that.

> 
> > init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> > unimportant because they should only happen once at boot.
> 
> We need to check if userspace can add an arbitrary number of entries by
> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.

Will do that altogether with the series.  I can further change both of
these calls to not track dirty at all, which shouldn't be hard, after
all userspace didn't even know them, as you mentioned below.

Is there anything to explain what KVM_SET_TSS_ADDR is used for?  This
is the thing I found that is closest to useful (from api.txt):

        This ioctl is required on Intel-based hosts.  This is needed
        on Intel hardware because of a quirk in the virtualization
        implementation (see the internals documentation when it pops
        into existence).

So... has it really popped into existance somewhere?  It would be good
at least to know why it does not need to be migrated.

> >> I don't think that's possible, most writes won't come from a page fault
> >> path and cannot retry.
> > 
> > Yep, maybe I should say it in the other way round: we only wait if
> > kvm_get_running_vcpu() == NULL.  Then in somewhere near
> > vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
> > Would that work?
> 
> Yes, that should work, especially if we know that kvmgt is the only case
> that can wait.  And since:
> 
> 1) kvmgt doesn't really need dirty page tracking (because VFIO devices
> generally don't track dirty pages, and because kvmgt shouldn't be using
> kvm_write_guest anyway)
> 
> 2) the real mode TSS and identity map shouldn't even be tracked, as they
> are invisible to userspace
> 
> it seems to me that kvm_get_running_vcpu() lets us get rid of the per-VM
> ring altogether.

Yes, it would be perfect if so.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 10:08                         ` Paolo Bonzini
  2019-12-16 18:54                           ` Peter Xu
@ 2019-12-17  2:28                           ` Tian, Kevin
  2019-12-17 16:18                             ` Alex Williamson
       [not found]                           ` <AADFC41AFE54684AB9EE6CBC0274A5D19D645E5F@SHSMSX104.ccr.corp.intel.com>
  2 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2019-12-17  2:28 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Xu
  Cc: Christopherson, Sean J, linux-kernel, kvm,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Alex Williamson, Wang,
	Zhenyu Z, Zhao, Yan Y

> From: Paolo Bonzini
> Sent: Monday, December 16, 2019 6:08 PM
> 
> [Alex and Kevin: there are doubts below regarding dirty page tracking
> from VFIO and mdev devices, which perhaps you can help with]
> 
> On 15/12/19 18:21, Peter Xu wrote:
> >                 init_rmode_tss
> >                     vmx_set_tss_addr
> >                         kvm_vm_ioctl_set_tss_addr [*]
> >                 init_rmode_identity_map
> >                     vmx_create_vcpu [*]
> 
> These don't matter because their content is not visible to userspace
> (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> 
> >                 vmx_write_pml_buffer
> >                     kvm_arch_write_log_dirty [&]
> >                 kvm_write_guest
> >                     kvm_hv_setup_tsc_page
> >                         kvm_guest_time_update [&]
> >                     nested_flush_cached_shadow_vmcs12 [&]
> >                     kvm_write_wall_clock [&]
> >                     kvm_pv_clock_pairing [&]
> >                     kvmgt_rw_gpa [?]
> 
> This then expands (partially) to
> 
> intel_gvt_hypervisor_write_gpa
>     emulate_csb_update
>         emulate_execlist_ctx_schedule_out
>             complete_execlist_workload
>                 complete_current_workload
>                      workload_thread
>         emulate_execlist_ctx_schedule_in
>             prepare_execlist_workload
>                 prepare_workload
>                     dispatch_workload
>                         workload_thread
> 
> So KVMGT is always writing to GPAs instead of IOVAs and basically
> bypassing a guest IOMMU.  So here it would be better if kvmgt was
> changed not use kvm_write_guest (also because I'd probably have nacked
> that if I had known :)).

I agree. 

> 
> As far as I know, there is some work on live migration with both VFIO
> and mdev, and that probably includes some dirty page tracking API.
> kvmgt could switch to that API, or there could be VFIO APIs similar to
> kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> tracking of writes from mdev devices.  Kevin, are these writes used in
> any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> writes from kvmgt vGPUs, or can the hardware write to memory as well
> (which would be my guess if I didn't know anything about kvmgt, which I
> pretty much don't)?

intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.

for hardware updates, it needs be mapped in IOMMU through vfio_pin_pages 
before any DMA happens. The ongoing dirty tracking effort in VFIO will take
every pinned page through that API as dirtied.

However, currently VFIO doesn't implement any vfio_read/write_guest
interface yet. and it doesn't make sense to use vfio_pin_pages for software
dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.

Alex, if you are OK we'll work on such interface and move kvmgt to use it.
After it's accepted, we can also mark pages dirty through this new interface
in Kirti's dirty page tracking series.

Thanks
Kevin

> 
> > We should only need to look at the leaves of the traces because
> > they're where the dirty request starts.  I'm marking all the leaves
> > with below criteria then it'll be easier to focus:
> >
> > Cases with [*]: should not matter much
> >            [&]: actually with a per-vcpu context in the upper layer
> >            [?]: uncertain...
> >
> > I'm a bit amazed after I took these notes, since I found that besides
> > those that could probbaly be ignored (marked as [*]), most of the rest
> > per-vm dirty requests are actually with a vcpu context.
> >
> > Although now because we have kvm_get_running_vcpu() all cases for [&]
> > should be fine without changing anything, but I tend to add another
> > patch in the next post to convert all the [&] cases explicitly to pass
> > vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> > then we verify that against kvm_get_running_vcpu().
> 
> This is a good idea but remember not to convert those to
> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
> SMRAM (most of the addresses are OS-controlled rather than
> firmware-controlled).
> 
> > init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> > unimportant because they should only happen once at boot.
> 
> We need to check if userspace can add an arbitrary number of entries by
> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in
> general.
> 
> >>> If we're still with the rule in userspace that we first do RESET then
> >>> collect and send the pages (just like what we've discussed before),
> >>> then IMHO it's fine to have vcpu2 to skip the slow path?  Because
> >>> RESET happens at "treat page as not dirty", then if we are sure that
> >>> we only collect and send pages after that point, then the latest
> >>> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> >>> blocked by vcpu1's ring full?
> >>
> >> Good point, the race would become
> >>
> >>  	vCPU 1			vCPU 2		host
> >>  	---------------------------------------------------------------
> >>  	mark page dirty
> >>  				write to page
> >> 						reset rings
> >> 						  wait for mmu lock
> >>  	add page to ring
> >> 	release mmu lock
> >> 						  ...do reset...
> >> 						  release mmu lock
> >> 						page is now dirty
> >
> > Hmm, the page will be dirty after the reset, but is that an issue?
> >
> > Or, could you help me to identify what I've missed?
> 
> Nothing: the race is always solved in such a way that there's no issue.
> 
> >> I don't think that's possible, most writes won't come from a page fault
> >> path and cannot retry.
> >
> > Yep, maybe I should say it in the other way round: we only wait if
> > kvm_get_running_vcpu() == NULL.  Then in somewhere near
> > vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
> > Would that work?
> 
> Yes, that should work, especially if we know that kvmgt is the only case
> that can wait.  And since:
> 
> 1) kvmgt doesn't really need dirty page tracking (because VFIO devices
> generally don't track dirty pages, and because kvmgt shouldn't be using
> kvm_write_guest anyway)
> 
> 2) the real mode TSS and identity map shouldn't even be tracked, as they
> are invisible to userspace
> 
> it seems to me that kvm_get_running_vcpu() lets us get rid of the per-VM
> ring altogether.
> 
> Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
       [not found]                           ` <AADFC41AFE54684AB9EE6CBC0274A5D19D645E5F@SHSMSX104.ccr.corp.intel.com>
@ 2019-12-17  5:17                             ` Tian, Kevin
  2019-12-17  5:25                               ` Yan Zhao
  0 siblings, 1 reply; 123+ messages in thread
From: Tian, Kevin @ 2019-12-17  5:17 UTC (permalink / raw)
  To: 'Paolo Bonzini', Peter Xu
  Cc: Christopherson, Sean J, linux-kernel, kvm,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Alex Williamson, Wang,
	Zhenyu Z, Zhao, Yan Y

> From: Tian, Kevin
> Sent: Tuesday, December 17, 2019 10:29 AM
> 
> > From: Paolo Bonzini
> > Sent: Monday, December 16, 2019 6:08 PM
> >
> > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > from VFIO and mdev devices, which perhaps you can help with]
> >
> > On 15/12/19 18:21, Peter Xu wrote:
> > >                 init_rmode_tss
> > >                     vmx_set_tss_addr
> > >                         kvm_vm_ioctl_set_tss_addr [*]
> > >                 init_rmode_identity_map
> > >                     vmx_create_vcpu [*]
> >
> > These don't matter because their content is not visible to userspace
> > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> >
> > >                 vmx_write_pml_buffer
> > >                     kvm_arch_write_log_dirty [&]
> > >                 kvm_write_guest
> > >                     kvm_hv_setup_tsc_page
> > >                         kvm_guest_time_update [&]
> > >                     nested_flush_cached_shadow_vmcs12 [&]
> > >                     kvm_write_wall_clock [&]
> > >                     kvm_pv_clock_pairing [&]
> > >                     kvmgt_rw_gpa [?]
> >
> > This then expands (partially) to
> >
> > intel_gvt_hypervisor_write_gpa
> >     emulate_csb_update
> >         emulate_execlist_ctx_schedule_out
> >             complete_execlist_workload
> >                 complete_current_workload
> >                      workload_thread
> >         emulate_execlist_ctx_schedule_in
> >             prepare_execlist_workload
> >                 prepare_workload
> >                     dispatch_workload
> >                         workload_thread
> >
> > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > changed not use kvm_write_guest (also because I'd probably have nacked
> > that if I had known :)).
> 
> I agree.
> 
> >
> > As far as I know, there is some work on live migration with both VFIO
> > and mdev, and that probably includes some dirty page tracking API.
> > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > tracking of writes from mdev devices.  Kevin, are these writes used in
> > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > (which would be my guess if I didn't know anything about kvmgt, which I
> > pretty much don't)?
> 
> intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> 
> for hardware updates, it needs be mapped in IOMMU through
> vfio_pin_pages
> before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> every pinned page through that API as dirtied.
> 
> However, currently VFIO doesn't implement any vfio_read/write_guest
> interface yet. and it doesn't make sense to use vfio_pin_pages for software
> dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.

One correction. vfio_pin_pages doesn't involve iommu invalidation. I should
just mean that pinning the page is not necessary. We just need a kvm-like
interface based on hva to access.

> 
> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> After it's accepted, we can also mark pages dirty through this new interface
> in Kirti's dirty page tracking series.
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17  5:17                             ` Tian, Kevin
@ 2019-12-17  5:25                               ` Yan Zhao
  2019-12-17 16:24                                 ` Alex Williamson
  0 siblings, 1 reply; 123+ messages in thread
From: Yan Zhao @ 2019-12-17  5:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: 'Paolo Bonzini',
	Peter Xu, Christopherson, Sean J, linux-kernel, kvm,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Alex Williamson, Wang,
	Zhenyu Z

On Tue, Dec 17, 2019 at 01:17:29PM +0800, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Tuesday, December 17, 2019 10:29 AM
> > 
> > > From: Paolo Bonzini
> > > Sent: Monday, December 16, 2019 6:08 PM
> > >
> > > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > > from VFIO and mdev devices, which perhaps you can help with]
> > >
> > > On 15/12/19 18:21, Peter Xu wrote:
> > > >                 init_rmode_tss
> > > >                     vmx_set_tss_addr
> > > >                         kvm_vm_ioctl_set_tss_addr [*]
> > > >                 init_rmode_identity_map
> > > >                     vmx_create_vcpu [*]
> > >
> > > These don't matter because their content is not visible to userspace
> > > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> > >
> > > >                 vmx_write_pml_buffer
> > > >                     kvm_arch_write_log_dirty [&]
> > > >                 kvm_write_guest
> > > >                     kvm_hv_setup_tsc_page
> > > >                         kvm_guest_time_update [&]
> > > >                     nested_flush_cached_shadow_vmcs12 [&]
> > > >                     kvm_write_wall_clock [&]
> > > >                     kvm_pv_clock_pairing [&]
> > > >                     kvmgt_rw_gpa [?]
> > >
> > > This then expands (partially) to
> > >
> > > intel_gvt_hypervisor_write_gpa
> > >     emulate_csb_update
> > >         emulate_execlist_ctx_schedule_out
> > >             complete_execlist_workload
> > >                 complete_current_workload
> > >                      workload_thread
> > >         emulate_execlist_ctx_schedule_in
> > >             prepare_execlist_workload
> > >                 prepare_workload
> > >                     dispatch_workload
> > >                         workload_thread
> > >
> > > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > > changed not use kvm_write_guest (also because I'd probably have nacked
> > > that if I had known :)).
> > 
> > I agree.
> > 
> > >
> > > As far as I know, there is some work on live migration with both VFIO
> > > and mdev, and that probably includes some dirty page tracking API.
> > > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > > tracking of writes from mdev devices.  Kevin, are these writes used in
> > > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > > (which would be my guess if I didn't know anything about kvmgt, which I
> > > pretty much don't)?
> > 
> > intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> > 
> > for hardware updates, it needs be mapped in IOMMU through
> > vfio_pin_pages
> > before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> > every pinned page through that API as dirtied.
> > 
> > However, currently VFIO doesn't implement any vfio_read/write_guest
> > interface yet. and it doesn't make sense to use vfio_pin_pages for software
> > dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.
> 
> One correction. vfio_pin_pages doesn't involve iommu invalidation. I should
> just mean that pinning the page is not necessary. We just need a kvm-like
> interface based on hva to access.
>
And can we propose to differentiate read and write when calling vfio_pin_pages, e.g.
vfio_pin_pages_read, vfio_pin_pages_write? Otherwise, calling to
vfio_pin_pages will unnecessarily cause read pages to be dirty and
sometimes reading guest pages is a way for device model to track dirty
pages.

> > 
> > Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> > After it's accepted, we can also mark pages dirty through this new interface
> > in Kirti's dirty page tracking series.
> > 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-16 18:54                           ` Peter Xu
@ 2019-12-17  9:01                             ` Paolo Bonzini
  2019-12-17 16:24                               ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-17  9:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On 16/12/19 19:54, Peter Xu wrote:
> On Mon, Dec 16, 2019 at 11:08:15AM +0100, Paolo Bonzini wrote:
>>> Although now because we have kvm_get_running_vcpu() all cases for [&]
>>> should be fine without changing anything, but I tend to add another
>>> patch in the next post to convert all the [&] cases explicitly to pass
>>> vcpu pointer instead of kvm pointer to be clear if no one disagrees,
>>> then we verify that against kvm_get_running_vcpu().
>>
>> This is a good idea but remember not to convert those to
>> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
>> SMRAM (most of the addresses are OS-controlled rather than
>> firmware-controlled).
> 
> OK.  I think I only need to pass in vcpu* instead of kvm* in
> kvm_write_guest_page() just like kvm_vcpu_write_guest(), however we
> still keep to only write to address space id==0 for that.

No, please pass it all the way down to the [&] functions but not to
kvm_write_guest_page.  Those should keep using vcpu->kvm.

>>> init_rmode_tss or init_rmode_identity_map.  But I've marked them as
>>> unimportant because they should only happen once at boot.
>>
>> We need to check if userspace can add an arbitrary number of entries by
>> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
>> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.
> 
> Will do that altogether with the series.  I can further change both of
> these calls to not track dirty at all, which shouldn't be hard, after
> all userspace didn't even know them, as you mentioned below.
> 
> Is there anything to explain what KVM_SET_TSS_ADDR is used for?  This
> is the thing I found that is closest to useful (from api.txt):

The best description is probably at https://lwn.net/Articles/658883/:

They are needed for unrestricted_guest=0. Remember that, in that case,
the VM always runs in protected mode and with paging enabled. In order
to emulate real mode you put the guest in a vm86 task, so you need some
place for a TSS and for a page table, and they must be in guest RAM
because the guest's TR and CR3 points to it. They are invisible to the
guest, because the STR and MOV-from-CR instructions are invalid in vm86
mode, but it must be there.

If you don't call KVM_SET_TSS_ADDR you actually get a complaint in
dmesg, and the TR stays at 0. I am not really sure what kind of bad
things can happen with unrestricted_guest=0, probably you just get a VM
Entry failure. The TSS takes 3 pages of memory. An interesting point is
that you actually don't need to set the TR selector to a valid value (as
you would do when running in "normal" vm86 mode), you can simply set the
base and limit registers that are hidden in the processor, and generally
inaccessible except through VMREAD/VMWRITE or system management mode. So
KVM needs to set up a TSS but not a GDT.

For paging, instead, 1 page is enough because we have only 4GB of memory
to address. KVM disables CR4.PAE (page address extensions, aka 8-byte
entries in each page directory or page table) and enables CR4.PSE (page
size extensions, aka 4MB huge pages support with 4-byte page directory
entries). One page then fits 1024 4-byte page directory entries, each
for a 4MB huge pages, totaling exactly 4GB. Here if you don't set it the
page table is at address 0xFFFBC000. QEMU changes it to 0xFEFFC000 so
that the BIOS can be up to 16MB in size (the default only allows 256k
between 0xFFFC0000 and 0xFFFFFFFF).

The different handling, where only the page table has a default, is
unfortunate, but so goes life...

> So... has it really popped into existance somewhere?  It would be good
> at least to know why it does not need to be migrated.

It does not need to be migrated just because the contents are constant.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-14  7:57       ` Paolo Bonzini
  2019-12-14 16:26         ` Peter Xu
@ 2019-12-17 12:16         ` Christophe de Dinechin
  2019-12-17 12:19           ` Paolo Bonzini
  1 sibling, 1 reply; 123+ messages in thread
From: Christophe de Dinechin @ 2019-12-17 12:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, Christophe de Dinechin, linux-kernel, kvm,
	Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov



> On 14 Dec 2019, at 08:57, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 13/12/19 21:23, Peter Xu wrote:
>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>> can share at most 256M of dirty memory each time? That seems low to me,
>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>> 
>>> Actually, this is not consistent with the definition in the code ;-)
>>> So I'll assume it's actually u32.
>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>> more. :)
> 
> It has to be u16, because it overlaps the padding of the first entry.

Wow, now that’s subtle.

That definitely needs a union with the padding to make this explicit.

(My guess is you do that to page-align the whole thing and avoid adding a
page just for the counters)

> 
> Paolo
> 
>> I think even u16 would be mostly enough (if you see, the maximum
>> allowed value currently is 64K entries only, not a big one).  Again,
>> the thing is that the userspace should be collecting the dirty bits,
>> so the ring shouldn't reach full easily.  Even if it does, we should
>> probably let it stop for a while as explained above.  It'll be
>> inefficient only if we set it to a too-small value, imho.
>> 
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 12:16         ` Christophe de Dinechin
@ 2019-12-17 12:19           ` Paolo Bonzini
  2019-12-17 15:38             ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-17 12:19 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Peter Xu, Christophe de Dinechin, linux-kernel, kvm,
	Sean Christopherson, Dr . David Alan Gilbert, Vitaly Kuznetsov

On 17/12/19 13:16, Christophe de Dinechin wrote:
> 
> 
>> On 14 Dec 2019, at 08:57, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 13/12/19 21:23, Peter Xu wrote:
>>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>>> can share at most 256M of dirty memory each time? That seems low to me,
>>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>>>
>>>> Actually, this is not consistent with the definition in the code ;-)
>>>> So I'll assume it's actually u32.
>>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>>> more. :)
>>
>> It has to be u16, because it overlaps the padding of the first entry.
> 
> Wow, now that’s subtle.
> 
> That definitely needs a union with the padding to make this explicit.
> 
> (My guess is you do that to page-align the whole thing and avoid adding a
> page just for the counters)

Yes, that was the idea but Peter decided to scrap it. :)

Paolo

>>
>> Paolo
>>
>>> I think even u16 would be mostly enough (if you see, the maximum
>>> allowed value currently is 64K entries only, not a big one).  Again,
>>> the thing is that the userspace should be collecting the dirty bits,
>>> so the ring shouldn't reach full easily.  Even if it does, we should
>>> probably let it stop for a while as explained above.  It'll be
>>> inefficient only if we set it to a too-small value, imho.
>>>
>>
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 12:19           ` Paolo Bonzini
@ 2019-12-17 15:38             ` Peter Xu
  2019-12-17 16:31               ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-17 15:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Tue, Dec 17, 2019 at 01:19:05PM +0100, Paolo Bonzini wrote:
> On 17/12/19 13:16, Christophe de Dinechin wrote:
> > 
> > 
> >> On 14 Dec 2019, at 08:57, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >> On 13/12/19 21:23, Peter Xu wrote:
> >>>> What is the benefit of using u16 for that? That means with 4K pages, you
> >>>> can share at most 256M of dirty memory each time? That seems low to me,
> >>>> especially since it's sufficient to touch one byte in a page to dirty it.
> >>>>
> >>>> Actually, this is not consistent with the definition in the code ;-)
> >>>> So I'll assume it's actually u32.
> >>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> >>> more. :)
> >>
> >> It has to be u16, because it overlaps the padding of the first entry.
> > 
> > Wow, now that’s subtle.
> > 
> > That definitely needs a union with the padding to make this explicit.
> > 
> > (My guess is you do that to page-align the whole thing and avoid adding a
> > page just for the counters)

(Just to make sure this is clear... Paolo was talking about the
 previous version.  This version does not have this limitation because
 we don't have that union definition any more)

> 
> Yes, that was the idea but Peter decided to scrap it. :)

There's still time to persuade me to going back to it. :)

(Though, yes I still like current solution... if we can get rid of the
 only kvmgt ugliness, we can even throw away the per-vm ring with its
 "extra" 4k page.  Then I suppose it'll be even harder to persuade me :)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17  2:28                           ` Tian, Kevin
@ 2019-12-17 16:18                             ` Alex Williamson
  2019-12-17 16:30                               ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Alex Williamson @ 2019-12-17 16:18 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Paolo Bonzini, Peter Xu, Christopherson, Sean J, linux-kernel,
	kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov, Wang, Zhenyu Z,
	Zhao, Yan Y

On Tue, 17 Dec 2019 02:28:33 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Paolo Bonzini
> > Sent: Monday, December 16, 2019 6:08 PM
> > 
> > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > from VFIO and mdev devices, which perhaps you can help with]
> > 
> > On 15/12/19 18:21, Peter Xu wrote:  
> > >                 init_rmode_tss
> > >                     vmx_set_tss_addr
> > >                         kvm_vm_ioctl_set_tss_addr [*]
> > >                 init_rmode_identity_map
> > >                     vmx_create_vcpu [*]  
> > 
> > These don't matter because their content is not visible to userspace
> > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> >   
> > >                 vmx_write_pml_buffer
> > >                     kvm_arch_write_log_dirty [&]
> > >                 kvm_write_guest
> > >                     kvm_hv_setup_tsc_page
> > >                         kvm_guest_time_update [&]
> > >                     nested_flush_cached_shadow_vmcs12 [&]
> > >                     kvm_write_wall_clock [&]
> > >                     kvm_pv_clock_pairing [&]
> > >                     kvmgt_rw_gpa [?]  
> > 
> > This then expands (partially) to
> > 
> > intel_gvt_hypervisor_write_gpa
> >     emulate_csb_update
> >         emulate_execlist_ctx_schedule_out
> >             complete_execlist_workload
> >                 complete_current_workload
> >                      workload_thread
> >         emulate_execlist_ctx_schedule_in
> >             prepare_execlist_workload
> >                 prepare_workload
> >                     dispatch_workload
> >                         workload_thread
> > 
> > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > changed not use kvm_write_guest (also because I'd probably have nacked
> > that if I had known :)).  
> 
> I agree. 
> 
> > 
> > As far as I know, there is some work on live migration with both VFIO
> > and mdev, and that probably includes some dirty page tracking API.
> > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > tracking of writes from mdev devices.  Kevin, are these writes used in
> > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > (which would be my guess if I didn't know anything about kvmgt, which I
> > pretty much don't)?  
> 
> intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> 
> for hardware updates, it needs be mapped in IOMMU through vfio_pin_pages 
> before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> every pinned page through that API as dirtied.
> 
> However, currently VFIO doesn't implement any vfio_read/write_guest
> interface yet. and it doesn't make sense to use vfio_pin_pages for software
> dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.
> 
> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> After it's accepted, we can also mark pages dirty through this new interface
> in Kirti's dirty page tracking series.

I'm not sure what you're asking for, is it an interface for the host
CPU to read/write the memory backing of a mapped IOVA range without
pinning pages?  That seems like something like that would make sense for
an emulation model where a page does not need to be pinned for physical
DMA.  If you're asking more for an interface that understands the
userspace driver is a VM (ie. implied using a _guest postfix on the
function name) and knows about GPA mappings beyond the windows directly
mapped for device access, I'd not look fondly on such a request.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17  9:01                             ` Paolo Bonzini
@ 2019-12-17 16:24                               ` Peter Xu
  2019-12-17 16:28                                 ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-17 16:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On Tue, Dec 17, 2019 at 10:01:40AM +0100, Paolo Bonzini wrote:
> On 16/12/19 19:54, Peter Xu wrote:
> > On Mon, Dec 16, 2019 at 11:08:15AM +0100, Paolo Bonzini wrote:
> >>> Although now because we have kvm_get_running_vcpu() all cases for [&]
> >>> should be fine without changing anything, but I tend to add another
> >>> patch in the next post to convert all the [&] cases explicitly to pass
> >>> vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> >>> then we verify that against kvm_get_running_vcpu().
> >>
> >> This is a good idea but remember not to convert those to
> >> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
> >> SMRAM (most of the addresses are OS-controlled rather than
> >> firmware-controlled).
> > 
> > OK.  I think I only need to pass in vcpu* instead of kvm* in
> > kvm_write_guest_page() just like kvm_vcpu_write_guest(), however we
> > still keep to only write to address space id==0 for that.
> 
> No, please pass it all the way down to the [&] functions but not to
> kvm_write_guest_page.  Those should keep using vcpu->kvm.

Actually I even wanted to refactor these helpers.  I mean, we have two
sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
the other set is per-vcpu.  IIUC the only difference of these two are
whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
just write to address space zero always.  Could we unify them into a
single set of helper (I'll just drop the *_vcpu_* helpers because it's
longer when write) but we always pass in vcpu* as the first parameter?
Then we add another parameter "vcpu_smm" to show whether we want to
consider the HF_SMM_MASK flag.

Kvmgt is of course special here because it does not have vcpu context,
but as we're going to rework that, I'd like to know whether you agree
with above refactoring if without the kvmgt caller.

> 
> >>> init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> >>> unimportant because they should only happen once at boot.
> >>
> >> We need to check if userspace can add an arbitrary number of entries by
> >> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
> >> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.
> > 
> > Will do that altogether with the series.  I can further change both of
> > these calls to not track dirty at all, which shouldn't be hard, after
> > all userspace didn't even know them, as you mentioned below.
> > 
> > Is there anything to explain what KVM_SET_TSS_ADDR is used for?  This
> > is the thing I found that is closest to useful (from api.txt):
> 
> The best description is probably at https://lwn.net/Articles/658883/:
> 
> They are needed for unrestricted_guest=0. Remember that, in that case,
> the VM always runs in protected mode and with paging enabled. In order
> to emulate real mode you put the guest in a vm86 task, so you need some
> place for a TSS and for a page table, and they must be in guest RAM
> because the guest's TR and CR3 points to it. They are invisible to the
> guest, because the STR and MOV-from-CR instructions are invalid in vm86
> mode, but it must be there.
> 
> If you don't call KVM_SET_TSS_ADDR you actually get a complaint in
> dmesg, and the TR stays at 0. I am not really sure what kind of bad
> things can happen with unrestricted_guest=0, probably you just get a VM
> Entry failure. The TSS takes 3 pages of memory. An interesting point is
> that you actually don't need to set the TR selector to a valid value (as
> you would do when running in "normal" vm86 mode), you can simply set the
> base and limit registers that are hidden in the processor, and generally
> inaccessible except through VMREAD/VMWRITE or system management mode. So
> KVM needs to set up a TSS but not a GDT.
> 
> For paging, instead, 1 page is enough because we have only 4GB of memory
> to address. KVM disables CR4.PAE (page address extensions, aka 8-byte
> entries in each page directory or page table) and enables CR4.PSE (page
> size extensions, aka 4MB huge pages support with 4-byte page directory
> entries). One page then fits 1024 4-byte page directory entries, each
> for a 4MB huge pages, totaling exactly 4GB. Here if you don't set it the
> page table is at address 0xFFFBC000. QEMU changes it to 0xFEFFC000 so
> that the BIOS can be up to 16MB in size (the default only allows 256k
> between 0xFFFC0000 and 0xFFFFFFFF).
> 
> The different handling, where only the page table has a default, is
> unfortunate, but so goes life...
> 
> > So... has it really popped into existance somewhere?  It would be good
> > at least to know why it does not need to be migrated.
> 
> It does not need to be migrated just because the contents are constant.

OK, thanks!  IIUC they should likely be all zeros then.

Do you think it's time to add most of these to kvm/api.txt? :)  I can
do that too if you like.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17  5:25                               ` Yan Zhao
@ 2019-12-17 16:24                                 ` Alex Williamson
  0 siblings, 0 replies; 123+ messages in thread
From: Alex Williamson @ 2019-12-17 16:24 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, 'Paolo Bonzini',
	Peter Xu, Christopherson, Sean J, linux-kernel, kvm,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Wang, Zhenyu Z

On Tue, 17 Dec 2019 00:25:02 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Dec 17, 2019 at 01:17:29PM +0800, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Tuesday, December 17, 2019 10:29 AM
> > >   
> > > > From: Paolo Bonzini
> > > > Sent: Monday, December 16, 2019 6:08 PM
> > > >
> > > > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > > > from VFIO and mdev devices, which perhaps you can help with]
> > > >
> > > > On 15/12/19 18:21, Peter Xu wrote:  
> > > > >                 init_rmode_tss
> > > > >                     vmx_set_tss_addr
> > > > >                         kvm_vm_ioctl_set_tss_addr [*]
> > > > >                 init_rmode_identity_map
> > > > >                     vmx_create_vcpu [*]  
> > > >
> > > > These don't matter because their content is not visible to userspace
> > > > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> > > >  
> > > > >                 vmx_write_pml_buffer
> > > > >                     kvm_arch_write_log_dirty [&]
> > > > >                 kvm_write_guest
> > > > >                     kvm_hv_setup_tsc_page
> > > > >                         kvm_guest_time_update [&]
> > > > >                     nested_flush_cached_shadow_vmcs12 [&]
> > > > >                     kvm_write_wall_clock [&]
> > > > >                     kvm_pv_clock_pairing [&]
> > > > >                     kvmgt_rw_gpa [?]  
> > > >
> > > > This then expands (partially) to
> > > >
> > > > intel_gvt_hypervisor_write_gpa
> > > >     emulate_csb_update
> > > >         emulate_execlist_ctx_schedule_out
> > > >             complete_execlist_workload
> > > >                 complete_current_workload
> > > >                      workload_thread
> > > >         emulate_execlist_ctx_schedule_in
> > > >             prepare_execlist_workload
> > > >                 prepare_workload
> > > >                     dispatch_workload
> > > >                         workload_thread
> > > >
> > > > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > > > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > > > changed not use kvm_write_guest (also because I'd probably have nacked
> > > > that if I had known :)).  
> > > 
> > > I agree.
> > >   
> > > >
> > > > As far as I know, there is some work on live migration with both VFIO
> > > > and mdev, and that probably includes some dirty page tracking API.
> > > > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > > > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > > > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > > > tracking of writes from mdev devices.  Kevin, are these writes used in
> > > > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > > > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > > > (which would be my guess if I didn't know anything about kvmgt, which I
> > > > pretty much don't)?  
> > > 
> > > intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> > > 
> > > for hardware updates, it needs be mapped in IOMMU through
> > > vfio_pin_pages
> > > before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> > > every pinned page through that API as dirtied.
> > > 
> > > However, currently VFIO doesn't implement any vfio_read/write_guest
> > > interface yet. and it doesn't make sense to use vfio_pin_pages for software
> > > dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.  
> > 
> > One correction. vfio_pin_pages doesn't involve iommu invalidation. I should
> > just mean that pinning the page is not necessary. We just need a kvm-like
> > interface based on hva to access.
> >  
> And can we propose to differentiate read and write when calling vfio_pin_pages, e.g.
> vfio_pin_pages_read, vfio_pin_pages_write? Otherwise, calling to
> vfio_pin_pages will unnecessarily cause read pages to be dirty and
> sometimes reading guest pages is a way for device model to track dirty
> pages.

Yes, I've discussed this with Kirti, when devices add more fine grained
dirty tracking we'll probably need to extend the mdev pinned pages
interface to allow vendor drivers to indicate a pinning is intended to
be used as read-only and perhaps also a way to unpin a page that was
pinned as read-write as clean, if the device did not write to it.  So
perhaps vfio_pin_pages_for_read() and vfio_unpin_pages_clean().  Thanks,

Alex


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:24                               ` Peter Xu
@ 2019-12-17 16:28                                 ` Paolo Bonzini
  2019-12-18 21:58                                   ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-17 16:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On 17/12/19 17:24, Peter Xu wrote:
>> No, please pass it all the way down to the [&] functions but not to
>> kvm_write_guest_page.  Those should keep using vcpu->kvm.
> Actually I even wanted to refactor these helpers.  I mean, we have two
> sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> the other set is per-vcpu.  IIUC the only difference of these two are
> whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> just write to address space zero always.

Right.

> Could we unify them into a
> single set of helper (I'll just drop the *_vcpu_* helpers because it's
> longer when write) but we always pass in vcpu* as the first parameter?
> Then we add another parameter "vcpu_smm" to show whether we want to
> consider the HF_SMM_MASK flag.

You'd have to check through all KVM implementations whether you always
have the vCPU.  Also non-x86 doesn't have address spaces, and by the
time you add ", true" or ", false" it's longer than the "_vcpu_" you
have removed.  So, not a good idea in my opinion. :D

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:18                             ` Alex Williamson
@ 2019-12-17 16:30                               ` Paolo Bonzini
  2019-12-18  0:29                                 ` Tian, Kevin
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-17 16:30 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: Peter Xu, Christopherson, Sean J, linux-kernel, kvm,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Wang, Zhenyu Z, Zhao,
	Yan Y

On 17/12/19 17:18, Alex Williamson wrote:
>>
>> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
>> After it's accepted, we can also mark pages dirty through this new interface
>> in Kirti's dirty page tracking series.
> I'm not sure what you're asking for, is it an interface for the host
> CPU to read/write the memory backing of a mapped IOVA range without
> pinning pages?  That seems like something like that would make sense for
> an emulation model where a page does not need to be pinned for physical
> DMA.  If you're asking more for an interface that understands the
> userspace driver is a VM (ie. implied using a _guest postfix on the
> function name) and knows about GPA mappings beyond the windows directly
> mapped for device access, I'd not look fondly on such a request.

No, it would definitely be the former, using IOVAs to access guest
memory---kvmgt is currently doing the latter by calling into KVM, and
I'm not really fond of that either.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 15:38             ` Peter Xu
@ 2019-12-17 16:31               ` Paolo Bonzini
  2019-12-17 16:42                 ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-17 16:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 17/12/19 16:38, Peter Xu wrote:
> There's still time to persuade me to going back to it. :)
> 
> (Though, yes I still like current solution... if we can get rid of the
>  only kvmgt ugliness, we can even throw away the per-vm ring with its
>  "extra" 4k page.  Then I suppose it'll be even harder to persuade me :)

Actually that's what convinced me in the first place, so let's
absolutely get rid of both the per-VM ring and the union.  Kevin and
Alex have answered and everybody seems to agree.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:31               ` Paolo Bonzini
@ 2019-12-17 16:42                 ` Peter Xu
  2019-12-17 16:48                   ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-17 16:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On Tue, Dec 17, 2019 at 05:31:48PM +0100, Paolo Bonzini wrote:
> On 17/12/19 16:38, Peter Xu wrote:
> > There's still time to persuade me to going back to it. :)
> > 
> > (Though, yes I still like current solution... if we can get rid of the
> >  only kvmgt ugliness, we can even throw away the per-vm ring with its
> >  "extra" 4k page.  Then I suppose it'll be even harder to persuade me :)
> 
> Actually that's what convinced me in the first place, so let's
> absolutely get rid of both the per-VM ring and the union.  Kevin and
> Alex have answered and everybody seems to agree.

Yeah that'd be perfect.

However I just noticed something... Note that we still didn't read
into non-x86 archs, I think it's the same question as when I asked
whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
me to read the non-x86 archs - I think it's time I read them, because
it's still possible that non-x86 archs will still need the per-vm
ring... then that could be another problem if we want to at last
spread the dirty ring idea outside of x86.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:42                 ` Peter Xu
@ 2019-12-17 16:48                   ` Paolo Bonzini
  2019-12-17 19:41                     ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-17 16:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov

On 17/12/19 17:42, Peter Xu wrote:
> 
> However I just noticed something... Note that we still didn't read
> into non-x86 archs, I think it's the same question as when I asked
> whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
> me to read the non-x86 archs - I think it's time I read them, because
> it's still possible that non-x86 archs will still need the per-vm
> ring... then that could be another problem if we want to at last
> spread the dirty ring idea outside of x86.

We can take a look, but I think based on x86 experience it's okay if we
restrict dirty ring to arches that do no VM-wide accesses.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:48                   ` Paolo Bonzini
@ 2019-12-17 19:41                     ` Peter Xu
  2019-12-18  0:33                       ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-17 19:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, David Hildenbrand, Eric Auger, Cornelia Huck

On Tue, Dec 17, 2019 at 05:48:58PM +0100, Paolo Bonzini wrote:
> On 17/12/19 17:42, Peter Xu wrote:
> > 
> > However I just noticed something... Note that we still didn't read
> > into non-x86 archs, I think it's the same question as when I asked
> > whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
> > me to read the non-x86 archs - I think it's time I read them, because
> > it's still possible that non-x86 archs will still need the per-vm
> > ring... then that could be another problem if we want to at last
> > spread the dirty ring idea outside of x86.
> 
> We can take a look, but I think based on x86 experience it's okay if we
> restrict dirty ring to arches that do no VM-wide accesses.

Here it is - a quick update on callers of mark_page_dirty_in_slot().
The same reverse trace, but ignoring all common and x86 code path
(which I covered in the other thread):

==================================

   mark_page_dirty_in_slot (non-x86)
        mark_page_dirty
            kvm_write_guest_page
                kvm_write_guest
                    kvm_write_guest_lock
                        vgic_its_save_ite [?]
                        vgic_its_save_dte [?]
                        vgic_its_save_cte [?]
                        vgic_its_save_collection_table [?]
                        vgic_v3_lpi_sync_pending_status [?]
                        vgic_v3_save_pending_tables [?]
                    kvmppc_rtas_hcall [&]
                    kvmppc_st [&]
                    access_guest [&]
                    put_guest_lc [&]
                    write_guest_lc [&]
                    write_guest_abs [&]
            mark_page_dirty
                _kvm_mips_map_page_fast [&]
                kvm_mips_map_page [&]
                kvmppc_mmu_map_page [&]
                kvmppc_copy_guest
                    kvmppc_h_page_init [&]
                kvmppc_xive_native_vcpu_eq_sync [&]
                adapter_indicators_set [?] (from kvm_set_irq)
                kvm_s390_sync_dirty_log [?]
                unpin_guest_page
                    unpin_blocks [&]
                    unpin_scb [&]

Cases with [*]: should not matter much
           [&]: should be able to change to per-vcpu context
           [?]: uncertain...

==================================

This time we've got 8 leaves with "[?]".

I'm starting with these:

        vgic_its_save_ite [?]
        vgic_its_save_dte [?]
        vgic_its_save_cte [?]
        vgic_its_save_collection_table [?]
        vgic_v3_lpi_sync_pending_status [?]
        vgic_v3_save_pending_tables [?]

These come from ARM specific ioctls like KVM_DEV_ARM_ITS_SAVE_TABLES,
KVM_DEV_ARM_ITS_RESTORE_TABLES, KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.
IIUC ARM needed these to allow proper migration which indeed does not
have a vcpu context.

(Though I'm a bit curious why ARM didn't simply migrate these
 information explicitly from userspace, instead it seems to me that
 ARM guests will dump something into guest ram and then tries to
 recover from that which seems to be a bit weird)
 
Then it's this:

        adapter_indicators_set [?]

This is s390 specific, which should come from kvm_set_irq.  I'm not
sure whether we can remove the mark_page_dirty() call of this, if it's
applied from another kernel structure (which should be migrated
properly IIUC).  But I might be completely wrong.

        kvm_s390_sync_dirty_log [?]
        
This is also s390 specific, should be collecting from the hardware
PGSTE_UC_BIT bit.  No vcpu context for sure.

(I'd be glad too if anyone could hint me why x86 cannot use page table
 dirty bits for dirty tracking, if there's short answer...)

I think my conclusion so far...

  - for s390 I don't think we even need this dirty ring buffer thing,
    because I think hardware trackings should be more efficient, then
    we don't need to care much on that either from design-wise of
    dirty ring,

  - for ARM, those no-vcpu-context dirty tracking probably needs to be
    considered, but hopefully that's a very special path so it rarely
    happen.  The bad thing is I didn't dig how many pages will be
    dirtied when ARM guest starts to dump all these things so it could
    be a burst...  If it is, then there's risk to trigger the ring
    full condition (which we wanted to avoid..)

I'm CCing Eric for ARM, Conny&David for s390, just in case there're
further inputs.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:30                               ` Paolo Bonzini
@ 2019-12-18  0:29                                 ` Tian, Kevin
  0 siblings, 0 replies; 123+ messages in thread
From: Tian, Kevin @ 2019-12-18  0:29 UTC (permalink / raw)
  To: Paolo Bonzini, Alex Williamson
  Cc: Peter Xu, Christopherson, Sean J, linux-kernel, kvm,
	Dr . David Alan Gilbert, Vitaly Kuznetsov, Wang, Zhenyu Z, Zhao,
	Yan Y

> From: Paolo Bonzini <pbonzini@redhat.com>
> Sent: Wednesday, December 18, 2019 12:31 AM
> 
> On 17/12/19 17:18, Alex Williamson wrote:
> >>
> >> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> >> After it's accepted, we can also mark pages dirty through this new
> interface
> >> in Kirti's dirty page tracking series.
> > I'm not sure what you're asking for, is it an interface for the host
> > CPU to read/write the memory backing of a mapped IOVA range without
> > pinning pages?  That seems like something like that would make sense for
> > an emulation model where a page does not need to be pinned for physical
> > DMA.  If you're asking more for an interface that understands the
> > userspace driver is a VM (ie. implied using a _guest postfix on the
> > function name) and knows about GPA mappings beyond the windows
> directly
> > mapped for device access, I'd not look fondly on such a request.
> 
> No, it would definitely be the former, using IOVAs to access guest
> memory---kvmgt is currently doing the latter by calling into KVM, and
> I'm not really fond of that either.
> 

Exactly. let's work on the fix.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 19:41                     ` Peter Xu
@ 2019-12-18  0:33                       ` Paolo Bonzini
  2019-12-18 16:32                         ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-18  0:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, David Hildenbrand, Eric Auger, Cornelia Huck

On 17/12/19 20:41, Peter Xu wrote:
> On Tue, Dec 17, 2019 at 05:48:58PM +0100, Paolo Bonzini wrote:
>> On 17/12/19 17:42, Peter Xu wrote:
>>>
>>> However I just noticed something... Note that we still didn't read
>>> into non-x86 archs, I think it's the same question as when I asked
>>> whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
>>> me to read the non-x86 archs - I think it's time I read them, because
>>> it's still possible that non-x86 archs will still need the per-vm
>>> ring... then that could be another problem if we want to at last
>>> spread the dirty ring idea outside of x86.
>>
>> We can take a look, but I think based on x86 experience it's okay if we
>> restrict dirty ring to arches that do no VM-wide accesses.
> 
> Here it is - a quick update on callers of mark_page_dirty_in_slot().
> The same reverse trace, but ignoring all common and x86 code path
> (which I covered in the other thread):
> 
> ==================================
> 
>    mark_page_dirty_in_slot (non-x86)
>         mark_page_dirty
>             kvm_write_guest_page
>                 kvm_write_guest
>                     kvm_write_guest_lock
>                         vgic_its_save_ite [?]
>                         vgic_its_save_dte [?]
>                         vgic_its_save_cte [?]
>                         vgic_its_save_collection_table [?]
>                         vgic_v3_lpi_sync_pending_status [?]
>                         vgic_v3_save_pending_tables [?]
>                     kvmppc_rtas_hcall [&]
>                     kvmppc_st [&]
>                     access_guest [&]
>                     put_guest_lc [&]
>                     write_guest_lc [&]
>                     write_guest_abs [&]
>             mark_page_dirty
>                 _kvm_mips_map_page_fast [&]
>                 kvm_mips_map_page [&]
>                 kvmppc_mmu_map_page [&]
>                 kvmppc_copy_guest
>                     kvmppc_h_page_init [&]
>                 kvmppc_xive_native_vcpu_eq_sync [&]
>                 adapter_indicators_set [?] (from kvm_set_irq)
>                 kvm_s390_sync_dirty_log [?]
>                 unpin_guest_page
>                     unpin_blocks [&]
>                     unpin_scb [&]
> 
> Cases with [*]: should not matter much
>            [&]: should be able to change to per-vcpu context
>            [?]: uncertain...
> 
> ==================================
> 
> This time we've got 8 leaves with "[?]".
> 
> I'm starting with these:
> 
>         vgic_its_save_ite [?]
>         vgic_its_save_dte [?]
>         vgic_its_save_cte [?]
>         vgic_its_save_collection_table [?]
>         vgic_v3_lpi_sync_pending_status [?]
>         vgic_v3_save_pending_tables [?]
> 
> These come from ARM specific ioctls like KVM_DEV_ARM_ITS_SAVE_TABLES,
> KVM_DEV_ARM_ITS_RESTORE_TABLES, KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.
> IIUC ARM needed these to allow proper migration which indeed does not
> have a vcpu context.
> 
> (Though I'm a bit curious why ARM didn't simply migrate these
>  information explicitly from userspace, instead it seems to me that
>  ARM guests will dump something into guest ram and then tries to
>  recover from that which seems to be a bit weird)
>  
> Then it's this:
> 
>         adapter_indicators_set [?]
> 
> This is s390 specific, which should come from kvm_set_irq.  I'm not
> sure whether we can remove the mark_page_dirty() call of this, if it's
> applied from another kernel structure (which should be migrated
> properly IIUC).  But I might be completely wrong.
> 
>         kvm_s390_sync_dirty_log [?]
>         
> This is also s390 specific, should be collecting from the hardware
> PGSTE_UC_BIT bit.  No vcpu context for sure.
> 
> (I'd be glad too if anyone could hint me why x86 cannot use page table
>  dirty bits for dirty tracking, if there's short answer...)

With PML it is.  Without PML, however, it would be much slower to
synchronize the dirty bitmap from KVM to userspace (one atomic operation
per page instead of one per 64 pages) and even impossible to have the
dirty ring.

> I think my conclusion so far...
> 
>   - for s390 I don't think we even need this dirty ring buffer thing,
>     because I think hardware trackings should be more efficient, then
>     we don't need to care much on that either from design-wise of
>     dirty ring,

I would be surprised if it's more efficient without something like PML,
but anyway the gist is correct---without write protection-based dirty
page logging, s390 cannot use the dirty page ring buffer.

>   - for ARM, those no-vcpu-context dirty tracking probably needs to be
>     considered, but hopefully that's a very special path so it rarely
>     happen.  The bad thing is I didn't dig how many pages will be
>     dirtied when ARM guest starts to dump all these things so it could
>     be a burst...  If it is, then there's risk to trigger the ring
>     full condition (which we wanted to avoid..)

It says all vCPU locks must be held, so it could just use any vCPU.  I
am not sure what's the upper limit on the number of entries, or even
whether userspace could just dirty those pages itself, or perhaps
whether there could be a different ioctl that gets the pages into
userspace memory (and then if needed userspace can copy them into guest
memory, I don't know why it is designed like that).

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-18  0:33                       ` Paolo Bonzini
@ 2019-12-18 16:32                         ` Peter Xu
  2019-12-18 16:41                           ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-18 16:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, David Hildenbrand, Eric Auger, Cornelia Huck

On Wed, Dec 18, 2019 at 01:33:01AM +0100, Paolo Bonzini wrote:
> On 17/12/19 20:41, Peter Xu wrote:
> > On Tue, Dec 17, 2019 at 05:48:58PM +0100, Paolo Bonzini wrote:
> >> On 17/12/19 17:42, Peter Xu wrote:
> >>>
> >>> However I just noticed something... Note that we still didn't read
> >>> into non-x86 archs, I think it's the same question as when I asked
> >>> whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
> >>> me to read the non-x86 archs - I think it's time I read them, because
> >>> it's still possible that non-x86 archs will still need the per-vm
> >>> ring... then that could be another problem if we want to at last
> >>> spread the dirty ring idea outside of x86.
> >>
> >> We can take a look, but I think based on x86 experience it's okay if we
> >> restrict dirty ring to arches that do no VM-wide accesses.
> > 
> > Here it is - a quick update on callers of mark_page_dirty_in_slot().
> > The same reverse trace, but ignoring all common and x86 code path
> > (which I covered in the other thread):
> > 
> > ==================================
> > 
> >    mark_page_dirty_in_slot (non-x86)
> >         mark_page_dirty
> >             kvm_write_guest_page
> >                 kvm_write_guest
> >                     kvm_write_guest_lock
> >                         vgic_its_save_ite [?]
> >                         vgic_its_save_dte [?]
> >                         vgic_its_save_cte [?]
> >                         vgic_its_save_collection_table [?]
> >                         vgic_v3_lpi_sync_pending_status [?]
> >                         vgic_v3_save_pending_tables [?]
> >                     kvmppc_rtas_hcall [&]
> >                     kvmppc_st [&]
> >                     access_guest [&]
> >                     put_guest_lc [&]
> >                     write_guest_lc [&]
> >                     write_guest_abs [&]
> >             mark_page_dirty
> >                 _kvm_mips_map_page_fast [&]
> >                 kvm_mips_map_page [&]
> >                 kvmppc_mmu_map_page [&]
> >                 kvmppc_copy_guest
> >                     kvmppc_h_page_init [&]
> >                 kvmppc_xive_native_vcpu_eq_sync [&]
> >                 adapter_indicators_set [?] (from kvm_set_irq)
> >                 kvm_s390_sync_dirty_log [?]
> >                 unpin_guest_page
> >                     unpin_blocks [&]
> >                     unpin_scb [&]
> > 
> > Cases with [*]: should not matter much
> >            [&]: should be able to change to per-vcpu context
> >            [?]: uncertain...
> > 
> > ==================================
> > 
> > This time we've got 8 leaves with "[?]".
> > 
> > I'm starting with these:
> > 
> >         vgic_its_save_ite [?]
> >         vgic_its_save_dte [?]
> >         vgic_its_save_cte [?]
> >         vgic_its_save_collection_table [?]
> >         vgic_v3_lpi_sync_pending_status [?]
> >         vgic_v3_save_pending_tables [?]
> > 
> > These come from ARM specific ioctls like KVM_DEV_ARM_ITS_SAVE_TABLES,
> > KVM_DEV_ARM_ITS_RESTORE_TABLES, KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.
> > IIUC ARM needed these to allow proper migration which indeed does not
> > have a vcpu context.
> > 
> > (Though I'm a bit curious why ARM didn't simply migrate these
> >  information explicitly from userspace, instead it seems to me that
> >  ARM guests will dump something into guest ram and then tries to
> >  recover from that which seems to be a bit weird)
> >  
> > Then it's this:
> > 
> >         adapter_indicators_set [?]
> > 
> > This is s390 specific, which should come from kvm_set_irq.  I'm not
> > sure whether we can remove the mark_page_dirty() call of this, if it's
> > applied from another kernel structure (which should be migrated
> > properly IIUC).  But I might be completely wrong.
> > 
> >         kvm_s390_sync_dirty_log [?]
> >         
> > This is also s390 specific, should be collecting from the hardware
> > PGSTE_UC_BIT bit.  No vcpu context for sure.
> > 
> > (I'd be glad too if anyone could hint me why x86 cannot use page table
> >  dirty bits for dirty tracking, if there's short answer...)
> 
> With PML it is.  Without PML, however, it would be much slower to
> synchronize the dirty bitmap from KVM to userspace (one atomic operation
> per page instead of one per 64 pages) and even impossible to have the
> dirty ring.

Indeed, however I think it'll be faster for hardware to mark page as
dirty.  So could it be a tradeoff on whether we want the "collection"
to be faster or "marking page dirty" to be faster?  IMHO "marking page
dirty" could be even more important sometimes because that affects
guest responsiveness (blocks vcpu execution), while the collection
procedure can happen in parrallel with that.

> 
> > I think my conclusion so far...
> > 
> >   - for s390 I don't think we even need this dirty ring buffer thing,
> >     because I think hardware trackings should be more efficient, then
> >     we don't need to care much on that either from design-wise of
> >     dirty ring,
> 
> I would be surprised if it's more efficient without something like PML,
> but anyway the gist is correct---without write protection-based dirty
> page logging, s390 cannot use the dirty page ring buffer.
> 
> >   - for ARM, those no-vcpu-context dirty tracking probably needs to be
> >     considered, but hopefully that's a very special path so it rarely
> >     happen.  The bad thing is I didn't dig how many pages will be
> >     dirtied when ARM guest starts to dump all these things so it could
> >     be a burst...  If it is, then there's risk to trigger the ring
> >     full condition (which we wanted to avoid..)
> 
> It says all vCPU locks must be held, so it could just use any vCPU.  I
> am not sure what's the upper limit on the number of entries, or even
> whether userspace could just dirty those pages itself, or perhaps
> whether there could be a different ioctl that gets the pages into
> userspace memory (and then if needed userspace can copy them into guest
> memory, I don't know why it is designed like that).

Yeah that's true.  I'll see whether Eric has more update on these...

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-18 16:32                         ` Peter Xu
@ 2019-12-18 16:41                           ` Paolo Bonzini
  0 siblings, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-18 16:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe de Dinechin, Christophe de Dinechin, linux-kernel,
	kvm, Sean Christopherson, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, David Hildenbrand, Eric Auger, Cornelia Huck

On 18/12/19 17:32, Peter Xu wrote:
>> With PML it is.  Without PML, however, it would be much slower to
>> synchronize the dirty bitmap from KVM to userspace (one atomic operation
>> per page instead of one per 64 pages) and even impossible to have the
>> dirty ring.
>
> Indeed, however I think it'll be faster for hardware to mark page as
> dirty.  So could it be a tradeoff on whether we want the "collection"
> to be faster or "marking page dirty" to be faster?  IMHO "marking page
> dirty" could be even more important sometimes because that affects
> guest responsiveness (blocks vcpu execution), while the collection
> procedure can happen in parrallel with that.

The problem is that the marking page dirty will be many many times
slower, because you don't have this

                        if (!dirty_bitmap[i])
                                continue;

and instead you have to scan the whole of the page tables even if a
handful of bits are set (reading  4K of memory for every 2M of guest
RAM).  This can be quite bad for the TLB too.  It is certainly possible
that it turns out to be faster but I would be quite surprised and, with
PML, that is more or less moot.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-17 16:28                                 ` Paolo Bonzini
@ 2019-12-18 21:58                                   ` Peter Xu
  2019-12-18 22:24                                     ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Peter Xu @ 2019-12-18 21:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On Tue, Dec 17, 2019 at 05:28:54PM +0100, Paolo Bonzini wrote:
> On 17/12/19 17:24, Peter Xu wrote:
> >> No, please pass it all the way down to the [&] functions but not to
> >> kvm_write_guest_page.  Those should keep using vcpu->kvm.
> > Actually I even wanted to refactor these helpers.  I mean, we have two
> > sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> > the other set is per-vcpu.  IIUC the only difference of these two are
> > whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> > just write to address space zero always.
> 
> Right.
> 
> > Could we unify them into a
> > single set of helper (I'll just drop the *_vcpu_* helpers because it's
> > longer when write) but we always pass in vcpu* as the first parameter?
> > Then we add another parameter "vcpu_smm" to show whether we want to
> > consider the HF_SMM_MASK flag.
> 
> You'd have to check through all KVM implementations whether you always
> have the vCPU.  Also non-x86 doesn't have address spaces, and by the
> time you add ", true" or ", false" it's longer than the "_vcpu_" you
> have removed.  So, not a good idea in my opinion. :D

Well, now I've changed my mind. :) (considering that we still have
many places that will not have vcpu*...)

I can simply add that "vcpu_smm" parameter to kvm_vcpu_write_*()
without removing the kvm_write_*() helpers.  Then I'll be able to
convert most of the kvm_write_*() (or its family) callers to
kvm_vcpu_write*(..., vcpu_smm=false) calls where proper.

Would that be good?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-18 21:58                                   ` Peter Xu
@ 2019-12-18 22:24                                     ` Sean Christopherson
  2019-12-18 22:37                                       ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2019-12-18 22:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On Wed, Dec 18, 2019 at 04:58:57PM -0500, Peter Xu wrote:
> On Tue, Dec 17, 2019 at 05:28:54PM +0100, Paolo Bonzini wrote:
> > On 17/12/19 17:24, Peter Xu wrote:
> > >> No, please pass it all the way down to the [&] functions but not to
> > >> kvm_write_guest_page.  Those should keep using vcpu->kvm.
> > > Actually I even wanted to refactor these helpers.  I mean, we have two
> > > sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> > > the other set is per-vcpu.  IIUC the only difference of these two are
> > > whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> > > just write to address space zero always.
> > 
> > Right.
> > 
> > > Could we unify them into a
> > > single set of helper (I'll just drop the *_vcpu_* helpers because it's
> > > longer when write) but we always pass in vcpu* as the first parameter?
> > > Then we add another parameter "vcpu_smm" to show whether we want to
> > > consider the HF_SMM_MASK flag.
> > 
> > You'd have to check through all KVM implementations whether you always
> > have the vCPU.  Also non-x86 doesn't have address spaces, and by the
> > time you add ", true" or ", false" it's longer than the "_vcpu_" you
> > have removed.  So, not a good idea in my opinion. :D
> 
> Well, now I've changed my mind. :) (considering that we still have
> many places that will not have vcpu*...)
> 
> I can simply add that "vcpu_smm" parameter to kvm_vcpu_write_*()
> without removing the kvm_write_*() helpers.  Then I'll be able to
> convert most of the kvm_write_*() (or its family) callers to
> kvm_vcpu_write*(..., vcpu_smm=false) calls where proper.
> 
> Would that be good?

I've lost track of the problem you're trying to solve, but if you do
something like "vcpu_smm=false", explicitly pass an address space ID
instead of hardcoding x86 specific SMM crud, e.g.

	kvm_vcpu_write*(..., as_id=0);

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-18 22:24                                     ` Sean Christopherson
@ 2019-12-18 22:37                                       ` Paolo Bonzini
  2019-12-18 22:49                                         ` Peter Xu
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2019-12-18 22:37 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-kernel, kvm, Dr . David Alan Gilbert, Vitaly Kuznetsov,
	Alex Williamson, Tian, Kevin

On 18/12/19 23:24, Sean Christopherson wrote:
> I've lost track of the problem you're trying to solve, but if you do
> something like "vcpu_smm=false", explicitly pass an address space ID
> instead of hardcoding x86 specific SMM crud, e.g.
> 
> 	kvm_vcpu_write*(..., as_id=0);

And the point of having kvm_vcpu_* vs. kvm_write_* was exactly to not
having to hardcode the address space ID.  If anything you could add a
__kvm_vcpu_write_* API that takes vcpu+as_id, but really I'd prefer to
keep kvm_get_running_vcpu() for now and then it can be refactored later.
 There are already way too many memory r/w APIs...

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-18 22:37                                       ` Paolo Bonzini
@ 2019-12-18 22:49                                         ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-18 22:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, linux-kernel, kvm, Dr . David Alan Gilbert,
	Vitaly Kuznetsov, Alex Williamson, Tian, Kevin

On Wed, Dec 18, 2019 at 11:37:31PM +0100, Paolo Bonzini wrote:
> On 18/12/19 23:24, Sean Christopherson wrote:
> > I've lost track of the problem you're trying to solve, but if you do
> > something like "vcpu_smm=false", explicitly pass an address space ID
> > instead of hardcoding x86 specific SMM crud, e.g.
> > 
> > 	kvm_vcpu_write*(..., as_id=0);
> 
> And the point of having kvm_vcpu_* vs. kvm_write_* was exactly to not
> having to hardcode the address space ID.  If anything you could add a
> __kvm_vcpu_write_* API that takes vcpu+as_id, but really I'd prefer to
> keep kvm_get_running_vcpu() for now and then it can be refactored later.
>  There are already way too many memory r/w APIs...

Yeah actuall that's why I wanted to start working on that just in case
it could help to unify all of them some day (and since we did go a few
steps forward on that when discussing the dirty ring).  But yeah
kvm_get_running_vcpu() for sure works for us already; let's go the
easy way this time.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
  2019-12-13 20:23     ` Peter Xu
  2019-12-14  7:57       ` Paolo Bonzini
@ 2019-12-20 18:19       ` Peter Xu
  1 sibling, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-12-20 18:19 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: linux-kernel, kvm, Sean Christopherson, Paolo Bonzini,
	Dr . David Alan Gilbert, Vitaly Kuznetsov

On Fri, Dec 13, 2019 at 03:23:24PM -0500, Peter Xu wrote:
> > > +If one of the ring buffers is full, the guest will exit to userspace
> > > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > > +should pause all the vcpus, then harvest all the dirty pages and
> > > +rearm the dirty traps. It can unpause the guest after that.
> > 
> > Except for the condition above, why is it necessary to pause other VCPUs
> > than the one being harvested?
> 
> This is a good question.  Paolo could correct me if I'm wrong.
> 
> Firstly I think this should rarely happen if the userspace is
> collecting the dirty bits from time to time.  If it happens, we'll
> need to call KVM_RESET_DIRTY_RINGS to reset all the rings.  Then the
> question actually becomes to: Whether we'd like to have per-vcpu
> KVM_RESET_DIRTY_RINGS?

Hmm when I'm rethinking this, I could have errornously deduced
something from Christophe's question.  Christophe was asking about why
kicking other vcpus, while it does not mean that the RESET will need
to do per-vcpu.

So now I tend to agree here with Christophe that I can't find a reason
why we need to kick all vcpus out.  Even if we need to do tlb flushing
for all vcpus when RESET, we can simply collect all the rings before
sending the RESET, then it's not really a reason to explicitly kick
them from userspace.  So I plan to remove this sentence in the next
version (which is only a document update).

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code
  2019-11-29 21:33 Peter Xu
@ 2019-11-29 21:33 ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:33 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Sean Christopherson, Paolo Bonzini, Dr . David Alan Gilbert,
	peterx, Vitaly Kuznetsov

From: Paolo Bonzini <pbonzini@redhat.com>

For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.

Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code.  This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/include/asm/kvm_host.h   |  2 --
 arch/arm64/include/asm/kvm_host.h |  2 --
 include/linux/kvm_host.h          |  3 +++
 virt/kvm/arm/arm.c                | 29 -----------------------------
 virt/kvm/arm/perf.c               |  6 +++---
 virt/kvm/arm/vgic/vgic-mmio.c     | 15 +++------------
 virt/kvm/kvm_main.c               | 25 ++++++++++++++++++++++++-
 7 files changed, 33 insertions(+), 49 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 556cd818eccf..abc3f6f3ad76 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -284,8 +284,6 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b36dae9ee5f9..d97855e41469 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -446,8 +446,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7ed1e2f8641e..498a39462ac1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1342,6 +1342,9 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 }
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 
+struct kvm_vcpu *kvm_get_running_vcpu(void);
+struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
+
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 12e0280291ce..1df9c39024fa 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -51,9 +51,6 @@ __asm__(".arch_extension	virt");
 DEFINE_PER_CPU(kvm_host_data_t, kvm_host_data);
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
 
-/* Per-CPU variable containing the currently running vcpu. */
-static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
-
 /* The VMID used in the VTTBR */
 static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 static u32 kvm_next_vmid;
@@ -62,31 +59,8 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
 static bool vgic_present;
 
 static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
-
-static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
-{
-	__this_cpu_write(kvm_arm_running_vcpu, vcpu);
-}
-
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
-/**
- * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
- * Must be called from non-preemptible context
- */
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
-{
-	return __this_cpu_read(kvm_arm_running_vcpu);
-}
-
-/**
- * kvm_arm_get_running_vcpus - get the per-CPU array of currently running vcpus.
- */
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
-{
-	return &kvm_arm_running_vcpu;
-}
-
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
@@ -406,7 +380,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vcpu->cpu = cpu;
 	vcpu->arch.host_cpu_context = &cpu_data->host_ctxt;
 
-	kvm_arm_set_running_vcpu(vcpu);
 	kvm_vgic_load(vcpu);
 	kvm_timer_vcpu_load(vcpu);
 	kvm_vcpu_load_sysregs(vcpu);
@@ -432,8 +405,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_vcpu_pmu_restore_host(vcpu);
 
 	vcpu->cpu = -1;
-
-	kvm_arm_set_running_vcpu(NULL);
 }
 
 static void vcpu_power_off(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/perf.c b/virt/kvm/arm/perf.c
index 918cdc3839ea..d45b8b9a4415 100644
--- a/virt/kvm/arm/perf.c
+++ b/virt/kvm/arm/perf.c
@@ -13,14 +13,14 @@
 
 static int kvm_is_in_guest(void)
 {
-        return kvm_arm_get_running_vcpu() != NULL;
+        return kvm_get_running_vcpu() != NULL;
 }
 
 static int kvm_is_user_mode(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return !vcpu_mode_priv(vcpu);
@@ -32,7 +32,7 @@ static unsigned long kvm_get_guest_ip(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return *vcpu_pc(vcpu);
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
index 0d090482720d..d656ebd5f9d4 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -190,15 +190,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
  * value later will give us the same value as we update the per-CPU variable
  * in the preempt notifier handlers.
  */
-static struct kvm_vcpu *vgic_get_mmio_requester_vcpu(void)
-{
-	struct kvm_vcpu *vcpu;
-
-	preempt_disable();
-	vcpu = kvm_arm_get_running_vcpu();
-	preempt_enable();
-	return vcpu;
-}
 
 /* Must be called with irq->irq_lock held */
 static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
@@ -221,7 +212,7 @@ void vgic_mmio_write_spending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -274,7 +265,7 @@ void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -335,7 +326,7 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
 				    bool active)
 {
 	unsigned long flags;
-	struct kvm_vcpu *requester_vcpu = vgic_get_mmio_requester_vcpu();
+	struct kvm_vcpu *requester_vcpu = kvm_get_running_vcpu();
 
 	raw_spin_lock_irqsave(&irq->irq_lock, flags);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 00268290dcbd..fac0760c870e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -108,6 +108,7 @@ struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
 static __read_mostly struct preempt_ops kvm_preempt_ops;
+static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
 
 struct dentry *kvm_debugfs_dir;
 EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
@@ -197,6 +198,8 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
+
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
@@ -208,6 +211,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	__this_cpu_write(kvm_running_vcpu, NULL);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);
@@ -4304,8 +4308,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	WRITE_ONCE(vcpu->preempted, false);
 	WRITE_ONCE(vcpu->ready, false);
 
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	kvm_arch_sched_in(vcpu, cpu);
-
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -4319,6 +4323,25 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 		WRITE_ONCE(vcpu->ready, true);
 	}
 	kvm_arch_vcpu_put(vcpu);
+	__this_cpu_write(kvm_running_vcpu, NULL);
+}
+
+/**
+ * kvm_get_running_vcpu - get the vcpu running on the current CPU.
+ * Thanks to preempt notifiers, this can also be called from
+ * preemptible context.
+ */
+struct kvm_vcpu *kvm_get_running_vcpu(void)
+{
+        return __this_cpu_read(kvm_running_vcpu);
+}
+
+/**
+ * kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
+ */
+struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
+{
+        return &kvm_running_vcpu;
 }
 
 static void check_processor_compat(void *rtn)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code
  2019-11-29 21:32 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
@ 2019-11-29 21:32 ` Peter Xu
  0 siblings, 0 replies; 123+ messages in thread
From: Peter Xu @ 2019-11-29 21:32 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Cao Lei, peterx, Dr . David Alan Gilbert,
	Sean Christopherson, Vitaly Kuznetsov

From: Paolo Bonzini <pbonzini@redhat.com>

For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.

Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code.  This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/include/asm/kvm_host.h   |  2 --
 arch/arm64/include/asm/kvm_host.h |  2 --
 include/linux/kvm_host.h          |  3 +++
 virt/kvm/arm/arm.c                | 29 -----------------------------
 virt/kvm/arm/perf.c               |  6 +++---
 virt/kvm/arm/vgic/vgic-mmio.c     | 15 +++------------
 virt/kvm/kvm_main.c               | 25 ++++++++++++++++++++++++-
 7 files changed, 33 insertions(+), 49 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 556cd818eccf..abc3f6f3ad76 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -284,8 +284,6 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b36dae9ee5f9..d97855e41469 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -446,8 +446,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7ed1e2f8641e..498a39462ac1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1342,6 +1342,9 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 }
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 
+struct kvm_vcpu *kvm_get_running_vcpu(void);
+struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
+
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 12e0280291ce..1df9c39024fa 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -51,9 +51,6 @@ __asm__(".arch_extension	virt");
 DEFINE_PER_CPU(kvm_host_data_t, kvm_host_data);
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
 
-/* Per-CPU variable containing the currently running vcpu. */
-static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
-
 /* The VMID used in the VTTBR */
 static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 static u32 kvm_next_vmid;
@@ -62,31 +59,8 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
 static bool vgic_present;
 
 static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
-
-static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
-{
-	__this_cpu_write(kvm_arm_running_vcpu, vcpu);
-}
-
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
-/**
- * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
- * Must be called from non-preemptible context
- */
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
-{
-	return __this_cpu_read(kvm_arm_running_vcpu);
-}
-
-/**
- * kvm_arm_get_running_vcpus - get the per-CPU array of currently running vcpus.
- */
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
-{
-	return &kvm_arm_running_vcpu;
-}
-
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
@@ -406,7 +380,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vcpu->cpu = cpu;
 	vcpu->arch.host_cpu_context = &cpu_data->host_ctxt;
 
-	kvm_arm_set_running_vcpu(vcpu);
 	kvm_vgic_load(vcpu);
 	kvm_timer_vcpu_load(vcpu);
 	kvm_vcpu_load_sysregs(vcpu);
@@ -432,8 +405,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_vcpu_pmu_restore_host(vcpu);
 
 	vcpu->cpu = -1;
-
-	kvm_arm_set_running_vcpu(NULL);
 }
 
 static void vcpu_power_off(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/perf.c b/virt/kvm/arm/perf.c
index 918cdc3839ea..d45b8b9a4415 100644
--- a/virt/kvm/arm/perf.c
+++ b/virt/kvm/arm/perf.c
@@ -13,14 +13,14 @@
 
 static int kvm_is_in_guest(void)
 {
-        return kvm_arm_get_running_vcpu() != NULL;
+        return kvm_get_running_vcpu() != NULL;
 }
 
 static int kvm_is_user_mode(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return !vcpu_mode_priv(vcpu);
@@ -32,7 +32,7 @@ static unsigned long kvm_get_guest_ip(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return *vcpu_pc(vcpu);
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
index 0d090482720d..d656ebd5f9d4 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -190,15 +190,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
  * value later will give us the same value as we update the per-CPU variable
  * in the preempt notifier handlers.
  */
-static struct kvm_vcpu *vgic_get_mmio_requester_vcpu(void)
-{
-	struct kvm_vcpu *vcpu;
-
-	preempt_disable();
-	vcpu = kvm_arm_get_running_vcpu();
-	preempt_enable();
-	return vcpu;
-}
 
 /* Must be called with irq->irq_lock held */
 static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
@@ -221,7 +212,7 @@ void vgic_mmio_write_spending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -274,7 +265,7 @@ void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -335,7 +326,7 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
 				    bool active)
 {
 	unsigned long flags;
-	struct kvm_vcpu *requester_vcpu = vgic_get_mmio_requester_vcpu();
+	struct kvm_vcpu *requester_vcpu = kvm_get_running_vcpu();
 
 	raw_spin_lock_irqsave(&irq->irq_lock, flags);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 00268290dcbd..fac0760c870e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -108,6 +108,7 @@ struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
 static __read_mostly struct preempt_ops kvm_preempt_ops;
+static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
 
 struct dentry *kvm_debugfs_dir;
 EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
@@ -197,6 +198,8 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
+
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
@@ -208,6 +211,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	__this_cpu_write(kvm_running_vcpu, NULL);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);
@@ -4304,8 +4308,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	WRITE_ONCE(vcpu->preempted, false);
 	WRITE_ONCE(vcpu->ready, false);
 
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	kvm_arch_sched_in(vcpu, cpu);
-
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -4319,6 +4323,25 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 		WRITE_ONCE(vcpu->ready, true);
 	}
 	kvm_arch_vcpu_put(vcpu);
+	__this_cpu_write(kvm_running_vcpu, NULL);
+}
+
+/**
+ * kvm_get_running_vcpu - get the vcpu running on the current CPU.
+ * Thanks to preempt notifiers, this can also be called from
+ * preemptible context.
+ */
+struct kvm_vcpu *kvm_get_running_vcpu(void)
+{
+        return __this_cpu_read(kvm_running_vcpu);
+}
+
+/**
+ * kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
+ */
+struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
+{
+        return &kvm_running_vcpu;
 }
 
 static void check_processor_compat(void *rtn)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2019-12-20 18:19 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
2019-11-29 21:34 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
2019-12-03 19:01   ` Sean Christopherson
2019-12-04  9:42     ` Paolo Bonzini
2019-12-09 22:05       ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot Peter Xu
2019-12-02 19:32   ` Sean Christopherson
2019-12-02 20:49     ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size Peter Xu
2019-12-02 19:30   ` Sean Christopherson
2019-12-02 20:53     ` Peter Xu
2019-12-02 22:19       ` Sean Christopherson
2019-12-02 22:40         ` Peter Xu
2019-12-03  5:50           ` Sean Christopherson
2019-12-03 13:41         ` Paolo Bonzini
2019-12-03 17:04           ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
2019-12-02 20:10   ` Sean Christopherson
2019-12-02 21:16     ` Peter Xu
2019-12-02 21:50       ` Sean Christopherson
2019-12-02 23:09         ` Peter Xu
2019-12-03 13:48         ` Paolo Bonzini
2019-12-03 18:46           ` Sean Christopherson
2019-12-04 10:05             ` Paolo Bonzini
2019-12-07  0:29               ` Sean Christopherson
2019-12-09  9:37                 ` Paolo Bonzini
2019-12-09 21:54               ` Peter Xu
2019-12-10 10:07                 ` Paolo Bonzini
2019-12-10 15:52                   ` Peter Xu
2019-12-10 17:09                     ` Paolo Bonzini
2019-12-15 17:21                       ` Peter Xu
2019-12-16 10:08                         ` Paolo Bonzini
2019-12-16 18:54                           ` Peter Xu
2019-12-17  9:01                             ` Paolo Bonzini
2019-12-17 16:24                               ` Peter Xu
2019-12-17 16:28                                 ` Paolo Bonzini
2019-12-18 21:58                                   ` Peter Xu
2019-12-18 22:24                                     ` Sean Christopherson
2019-12-18 22:37                                       ` Paolo Bonzini
2019-12-18 22:49                                         ` Peter Xu
2019-12-17  2:28                           ` Tian, Kevin
2019-12-17 16:18                             ` Alex Williamson
2019-12-17 16:30                               ` Paolo Bonzini
2019-12-18  0:29                                 ` Tian, Kevin
     [not found]                           ` <AADFC41AFE54684AB9EE6CBC0274A5D19D645E5F@SHSMSX104.ccr.corp.intel.com>
2019-12-17  5:17                             ` Tian, Kevin
2019-12-17  5:25                               ` Yan Zhao
2019-12-17 16:24                                 ` Alex Williamson
2019-12-03 19:13   ` Sean Christopherson
2019-12-04 10:14     ` Paolo Bonzini
2019-12-04 14:33       ` Sean Christopherson
2019-12-04 10:38   ` Jason Wang
2019-12-04 11:04     ` Paolo Bonzini
2019-12-04 19:52       ` Peter Xu
2019-12-05  6:51         ` Jason Wang
2019-12-05 12:08           ` Peter Xu
2019-12-05 13:12             ` Jason Wang
2019-12-10 13:25       ` Michael S. Tsirkin
2019-12-10 13:31         ` Paolo Bonzini
2019-12-10 16:02           ` Peter Xu
2019-12-10 21:53             ` Michael S. Tsirkin
2019-12-11  9:05               ` Paolo Bonzini
2019-12-11 13:04                 ` Michael S. Tsirkin
2019-12-11 14:54                   ` Peter Xu
2019-12-10 21:48           ` Michael S. Tsirkin
2019-12-11 12:53   ` Michael S. Tsirkin
2019-12-11 14:14     ` Paolo Bonzini
2019-12-11 20:59     ` Peter Xu
2019-12-11 22:57       ` Michael S. Tsirkin
2019-12-12  0:08         ` Paolo Bonzini
2019-12-12  7:36           ` Michael S. Tsirkin
2019-12-12  8:12             ` Paolo Bonzini
2019-12-12 10:38               ` Michael S. Tsirkin
2019-12-15 17:33           ` Peter Xu
2019-12-16  9:47             ` Michael S. Tsirkin
2019-12-16 15:07               ` Peter Xu
2019-12-16 15:33                 ` Michael S. Tsirkin
2019-12-16 15:47                   ` Peter Xu
2019-12-11 17:24   ` Christophe de Dinechin
2019-12-13 20:23     ` Peter Xu
2019-12-14  7:57       ` Paolo Bonzini
2019-12-14 16:26         ` Peter Xu
2019-12-16  9:29           ` Paolo Bonzini
2019-12-16 15:26             ` Peter Xu
2019-12-16 15:31               ` Paolo Bonzini
2019-12-16 15:43                 ` Peter Xu
2019-12-17 12:16         ` Christophe de Dinechin
2019-12-17 12:19           ` Paolo Bonzini
2019-12-17 15:38             ` Peter Xu
2019-12-17 16:31               ` Paolo Bonzini
2019-12-17 16:42                 ` Peter Xu
2019-12-17 16:48                   ` Paolo Bonzini
2019-12-17 19:41                     ` Peter Xu
2019-12-18  0:33                       ` Paolo Bonzini
2019-12-18 16:32                         ` Peter Xu
2019-12-18 16:41                           ` Paolo Bonzini
2019-12-20 18:19       ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 05/15] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2019-11-29 21:34 ` [PATCH RFC 06/15] KVM: Introduce dirty ring wait queue Peter Xu
2019-11-29 21:34 ` [PATCH RFC 07/15] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2019-11-29 21:34 ` [PATCH RFC 08/15] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2019-11-29 21:34 ` [PATCH RFC 09/15] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2019-11-29 21:35 ` [PATCH RFC 10/15] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 11/15] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2019-11-29 21:35 ` [PATCH RFC 12/15] KVM: selftests: Add dirty ring buffer test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 13/15] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 14/15] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 15/15] KVM: selftests: Test dirty ring waitqueue Peter Xu
2019-11-30  8:29 ` [PATCH RFC 00/15] KVM: Dirty ring interface Paolo Bonzini
2019-12-02  2:13   ` Peter Xu
2019-12-03 13:59     ` Paolo Bonzini
2019-12-05 19:30       ` Peter Xu
2019-12-05 19:59         ` Paolo Bonzini
2019-12-05 20:52           ` Peter Xu
2019-12-02 20:21   ` Sean Christopherson
2019-12-02 20:43     ` Peter Xu
2019-12-04 10:39 ` Jason Wang
2019-12-04 19:33   ` Peter Xu
2019-12-05  6:49     ` Jason Wang
2019-12-11 13:41 ` Christophe de Dinechin
2019-12-11 14:16   ` Paolo Bonzini
2019-12-11 17:15     ` Peter Xu
  -- strict thread matches above, loose matches on Subject: below --
2019-11-29 21:33 Peter Xu
2019-11-29 21:33 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
2019-11-29 21:32 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
2019-11-29 21:32 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).