linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/21] KVM: Dirty ring interface
@ 2020-01-09 14:57 Peter Xu
  2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
                   ` (23 more replies)
  0 siblings, 24 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
(based on kvm/queue)

Please refer to either the previous cover letters, or documentation
update in patch 12 for the big picture.  Previous posts:

V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com

The major change in V3 is that we dropped the whole waitqueue and the
global lock. With that, we have clean per-vcpu ring and no default
ring any more.  The two kvmgt refactoring patches were also included
to show the dependency of the works.

Patchset layout:

Patch 1-2:         Picked up from kvmgt refactoring
Patch 3-6:         Small patches that are not directly related,
                   (So can be acked/nacked/picked as standalone)
Patch 7-11:        Prepares for the dirty ring interface
Patch 12:          Major implementation
Patch 13-14:       Quick follow-ups for patch 8
Patch 15-21:       Test cases

V3 changelog:

- fail userspace writable maps on dirty ring ranges [Jason]
- commit message fixups [Paolo]
- change __x86_set_memory_region to return hva [Paolo]
- cacheline align for indices [Paolo, Jason]
- drop waitqueue, global lock, etc., include kvmgt rework patchset
- take lock for __x86_set_memory_region() (otherwise it triggers a
  lockdep in latest kvm/queue) [Paolo]
- check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
- one more patch to drop x86_set_memory_region [Paolo]
- one more patch to remove extra srcu usage in init_rmode_identity_map()
- add some r-bs for Paolo

Please review, thanks.

Paolo Bonzini (1):
  KVM: Move running VCPU from ARM to common code

Peter Xu (18):
  KVM: Remove kvm_read_guest_atomic()
  KVM: Add build-time error check on kvm_run size
  KVM: X86: Change parameter for fast_page_fault tracepoint
  KVM: X86: Don't take srcu lock in init_rmode_identity_map()
  KVM: Cache as_id in kvm_memory_slot
  KVM: X86: Drop x86_set_memory_region()
  KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  KVM: X86: Implement ring-based dirty memory tracking
  KVM: Make dirty ring exclusive to dirty bitmap log
  KVM: Don't allocate dirty bitmap if dirty ring is enabled
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Sync uapi/linux/kvm.h to tools/
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  KVM: selftests: Add dirty ring buffer test
  KVM: selftests: Let dirty_log_test async for dirty ring test
  KVM: selftests: Add "-c" parameter to dirty log test

Yan Zhao (2):
  vfio: introduce vfio_iova_rw to read/write a range of IOVAs
  drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw

 Documentation/virt/kvm/api.txt                |  96 ++++
 arch/arm/include/asm/kvm_host.h               |   2 -
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/x86/include/asm/kvm_host.h               |   7 +-
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/Makefile                         |   3 +-
 arch/x86/kvm/mmu/mmu.c                        |   6 +
 arch/x86/kvm/mmutrace.h                       |   9 +-
 arch/x86/kvm/svm.c                            |   3 +-
 arch/x86/kvm/vmx/vmx.c                        |  86 ++--
 arch/x86/kvm/x86.c                            |  43 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c              |  25 +-
 drivers/vfio/vfio.c                           |  45 ++
 drivers/vfio/vfio_iommu_type1.c               |  81 ++++
 include/linux/kvm_dirty_ring.h                |  55 +++
 include/linux/kvm_host.h                      |  37 +-
 include/linux/vfio.h                          |   5 +
 include/trace/events/kvm.h                    |  78 ++++
 include/uapi/linux/kvm.h                      |  33 ++
 tools/include/uapi/linux/kvm.h                |  38 ++
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   4 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  72 +++
 .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
 virt/kvm/arm/arch_timer.c                     |   2 +-
 virt/kvm/arm/arm.c                            |  29 --
 virt/kvm/arm/perf.c                           |   6 +-
 virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
 virt/kvm/dirty_ring.c                         | 162 +++++++
 virt/kvm/kvm_main.c                           | 215 +++++++--
 32 files changed, 1379 insertions(+), 208 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
 create mode 100644 virt/kvm/dirty_ring.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

From: Yan Zhao <yan.y.zhao@intel.com>

vfio_iova_rw will read/write a range of userspace memory (starting form
device iova to iova + len -1) into a kenrel buffer without pinning the
userspace memory.

TODO: vfio needs to mark the iova dirty if vfio_iova_rw(write) is
called.

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/vfio.c             | 45 ++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c | 81 +++++++++++++++++++++++++++++++++
 include/linux/vfio.h            |  5 ++
 3 files changed, 131 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c8482624ca34..36e91e647ed5 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1961,6 +1961,51 @@ int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, int npage)
 }
 EXPORT_SYMBOL(vfio_unpin_pages);
 
+/*
+ * Read/Write a range of userspace IOVAs for a device into/from a kernel
+ * buffer without pinning the userspace memory
+ * @dev [in]  : device
+ * @iova [in] : base IOVA of a userspace buffer
+ * @data [in] : pointer to kernel buffer
+ * @len [in]  : kernel buffer length
+ * @write     : indicate read or write
+ * Return error on failure or 0 on success.
+ */
+int vfio_iova_rw(struct device *dev, unsigned long iova, void *data,
+		   unsigned long len, bool write)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	int ret = 0;
+
+	if (!dev || !data || len <= 0)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (!group)
+		return -ENODEV;
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto out;
+
+	container = group->container;
+	driver = container->iommu_driver;
+
+	if (likely(driver && driver->ops->iova_rw))
+		ret = driver->ops->iova_rw(container->iommu_data,
+					   iova, data, len, write);
+	else
+		ret = -ENOTTY;
+
+	vfio_group_try_dissolve_container(group);
+out:
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_iova_rw);
+
 static int vfio_register_iommu_notifier(struct vfio_group *group,
 					unsigned long *events,
 					struct notifier_block *nb)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..aee191077235 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -27,6 +27,7 @@
 #include <linux/iommu.h>
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/rbtree.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
@@ -2326,6 +2327,85 @@ static int vfio_iommu_type1_unregister_notifier(void *iommu_data,
 	return blocking_notifier_chain_unregister(&iommu->notifier, nb);
 }
 
+static int next_segment(unsigned long len, int offset)
+{
+	if (len > PAGE_SIZE - offset)
+		return PAGE_SIZE - offset;
+	else
+		return len;
+}
+
+static int vfio_iommu_type1_rw_iova_seg(struct vfio_iommu *iommu,
+					  unsigned long iova, void *data,
+					  unsigned long seg_len,
+					  unsigned long offset,
+					  bool write)
+{
+	struct mm_struct *mm;
+	unsigned long vaddr;
+	struct vfio_dma *dma;
+	bool kthread = current->mm == NULL;
+	int ret = 0;
+
+	dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
+	if (!dma)
+		return -EINVAL;
+
+	mm = get_task_mm(dma->task);
+
+	if (!mm)
+		return -ENODEV;
+
+	if (kthread)
+		use_mm(mm);
+	else if (current->mm != mm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	vaddr = dma->vaddr + iova - dma->iova + offset;
+
+	ret = write ? __copy_to_user((void __user *)vaddr,
+			data, seg_len) :
+		__copy_from_user(data, (void __user *)vaddr,
+				seg_len);
+	if (ret)
+		ret = -EFAULT;
+
+	if (kthread)
+		unuse_mm(mm);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static int vfio_iommu_type1_iova_rw(void *iommu_data, unsigned long iova,
+				    void *data, unsigned long len, bool write)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	int offset = iova & ~PAGE_MASK;
+	int seg_len;
+	int ret = 0;
+
+	iova = iova & PAGE_MASK;
+
+	mutex_lock(&iommu->lock);
+	while ((seg_len = next_segment(len, offset)) > 0) {
+		ret = vfio_iommu_type1_rw_iova_seg(iommu, iova, data,
+						   seg_len, offset, write);
+		if (ret)
+			break;
+
+		offset = 0;
+		len -= seg_len;
+		data += seg_len;
+		iova += PAGE_SIZE;
+	}
+
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.name			= "vfio-iommu-type1",
 	.owner			= THIS_MODULE,
@@ -2338,6 +2418,7 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.unpin_pages		= vfio_iommu_type1_unpin_pages,
 	.register_notifier	= vfio_iommu_type1_register_notifier,
 	.unregister_notifier	= vfio_iommu_type1_unregister_notifier,
+	.iova_rw		= vfio_iommu_type1_iova_rw,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711a2800..7bf18a31bbcf 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -82,6 +82,8 @@ struct vfio_iommu_driver_ops {
 					     struct notifier_block *nb);
 	int		(*unregister_notifier)(void *iommu_data,
 					       struct notifier_block *nb);
+	int		(*iova_rw)(void *iommu_data, unsigned long iova,
+				   void *data, unsigned long len, bool write);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -107,6 +109,9 @@ extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
 extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn,
 			    int npage);
 
+extern int vfio_iova_rw(struct device *dev, unsigned long iova, void *data,
+			unsigned long len, bool write);
+
 /* each type has independent events */
 enum vfio_notify_type {
 	VFIO_IOMMU_NOTIFY = 0,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
  2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

From: Yan Zhao <yan.y.zhao@intel.com>

As a device model, it is better to read/write guest memory using vfio
interface, so that vfio is able to maintain dirty info of device IOVAs.

Compared to CPU side interfaces kvm_read/write_guest(), vfio_iova_rw()
has ~600 cycles more overhead on average.
-------------------------------------
|    interface     | avg cpu cycles |
|-----------------------------------|
| kvm_write_guest  |     1546       |
| ----------------------------------|
| kvm_read_guest   |     686        |
|-----------------------------------|
| vfio_iova_rw(w)  |     2233       |
|-----------------------------------|
| vfio_iova_rw(r)  |     1262       |
-------------------------------------

Comparison of benchmarks scores are as blow:
---------------------------------------------------------
|  avg score  | kvm_read/write_guest   | vfio_iova_rw   |
---------------------------------------------------------
|   Glmark2   |         1132           |      1138.2    |
---------------------------------------------------------
|  Lightsmark |        61.558          |      61.538    |
|--------------------------------------------------------
|  OpenArena  |        142.77          |      136.6     |
---------------------------------------------------------
|   Heaven    |         698            |      686.8     |
--------------------------------------------------------
No obvious performance downgrade found.

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[peterx: pass in "write" to vfio_iova_rw(), suggested by Paolo]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/gpu/drm/i915/gvt/kvmgt.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 3259a1fa69e1..5fb82f285b98 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1968,31 +1968,18 @@ static int kvmgt_rw_gpa(unsigned long handle, unsigned long gpa,
 			void *buf, unsigned long len, bool write)
 {
 	struct kvmgt_guest_info *info;
-	struct kvm *kvm;
-	int idx, ret;
-	bool kthread = current->mm == NULL;
+	int ret;
+	struct intel_vgpu *vgpu;
+	struct device *dev;
 
 	if (!handle_valid(handle))
 		return -ESRCH;
 
 	info = (struct kvmgt_guest_info *)handle;
-	kvm = info->kvm;
-
-	if (kthread) {
-		if (!mmget_not_zero(kvm->mm))
-			return -EFAULT;
-		use_mm(kvm->mm);
-	}
-
-	idx = srcu_read_lock(&kvm->srcu);
-	ret = write ? kvm_write_guest(kvm, gpa, buf, len) :
-		      kvm_read_guest(kvm, gpa, buf, len);
-	srcu_read_unlock(&kvm->srcu, idx);
+	vgpu = info->vgpu;
+	dev = mdev_dev(vgpu->vdev.mdev);
 
-	if (kthread) {
-		unuse_mm(kvm->mm);
-		mmput(kvm->mm);
-	}
+	ret = vfio_iova_rw(dev, gpa, buf, len, write);
 
 	return ret;
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic()
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
  2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
  2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Remove kvm_read_guest_atomic() because it's not used anywhere.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h |  2 --
 virt/kvm/kvm_main.c      | 11 -----------
 2 files changed, 13 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 528ab7a814ab..2337f9b6112c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -725,8 +725,6 @@ void kvm_get_pfn(kvm_pfn_t pfn);
 
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len);
-int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
-			  unsigned long len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 			   void *data, unsigned long len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3aa21bec028d..24c9cf4c8a52 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2048,17 +2048,6 @@ static int __kvm_read_guest_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
 	return 0;
 }
 
-int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
-			  unsigned long len)
-{
-	gfn_t gfn = gpa >> PAGE_SHIFT;
-	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
-	int offset = offset_in_page(gpa);
-
-	return __kvm_read_guest_atomic(slot, gfn, data, offset, len);
-}
-EXPORT_SYMBOL_GPL(kvm_read_guest_atomic);
-
 int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 			       void *data, unsigned long len)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (2 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

It's already going to reach 2400 Bytes (which is over half of page
size on 4K page archs), so maybe it's good to have this build-time
check in case it overflows when adding new fields.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 24c9cf4c8a52..70b78ccaf3b5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -338,6 +338,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->pre_pcpu = -1;
 	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
 
+	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
 		r = -ENOMEM;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (3 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

It would be clearer to dump the return value to know easily on whether
did we go through the fast path for handling current page fault.
Remove the old two last parameters because after all the old/new sptes
were dumped in the same line.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmutrace.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 3c6522b84ff1..456371406d2a 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -244,9 +244,6 @@ TRACE_EVENT(
 		  __entry->access)
 );
 
-#define __spte_satisfied(__spte)				\
-	(__entry->retry && is_writable_pte(__entry->__spte))
-
 TRACE_EVENT(
 	fast_page_fault,
 	TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
@@ -274,12 +271,10 @@ TRACE_EVENT(
 	),
 
 	TP_printk("vcpu %d gva %llx error_code %s sptep %p old %#llx"
-		  " new %llx spurious %d fixed %d", __entry->vcpu_id,
+		  " new %llx ret %d", __entry->vcpu_id,
 		  __entry->cr2_or_gpa, __print_flags(__entry->error_code, "|",
 		  kvm_mmu_trace_pferr_flags), __entry->sptep,
-		  __entry->old_spte, __entry->new_spte,
-		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
-	)
+		  __entry->old_spte, __entry->new_spte, __entry->retry)
 );
 
 TRACE_EVENT(
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map()
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (4 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

We've already got the slots_lock, so we should be safe.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b5a0c2e05825..7add2fc8d8e9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3475,7 +3475,7 @@ static int init_rmode_tss(struct kvm *kvm)
 static int init_rmode_identity_map(struct kvm *kvm)
 {
 	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
-	int i, idx, r = 0;
+	int i, r = 0;
 	kvm_pfn_t identity_map_pfn;
 	u32 tmp;
 
@@ -3483,7 +3483,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	mutex_lock(&kvm->slots_lock);
 
 	if (likely(kvm_vmx->ept_identity_pagetable_done))
-		goto out2;
+		goto out;
 
 	if (!kvm_vmx->ept_identity_map_addr)
 		kvm_vmx->ept_identity_map_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR;
@@ -3492,9 +3492,8 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	r = __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
 				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE);
 	if (r < 0)
-		goto out2;
+		goto out;
 
-	idx = srcu_read_lock(&kvm->srcu);
 	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
 	if (r < 0)
 		goto out;
@@ -3510,9 +3509,6 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	kvm_vmx->ept_identity_pagetable_done = true;
 
 out:
-	srcu_read_unlock(&kvm->srcu, idx);
-
-out2:
 	mutex_unlock(&kvm->slots_lock);
 	return r;
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (5 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Cache the address space ID just like the slot ID.  It will be used in
order to fill in the dirty ring entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2337f9b6112c..763adf8c47b0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -348,6 +348,7 @@ struct kvm_memory_slot {
 	unsigned long userspace_addr;
 	u32 flags;
 	short id;
+	u8 as_id;
 };
 
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70b78ccaf3b5..1fd204f27028 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1040,6 +1040,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
 	new = old = *slot;
 
+	BUILD_BUG_ON(U8_MAX < KVM_ADDRESS_SPACE_NUM);
+	new.as_id = as_id;
 	new.id = id;
 	new.base_gfn = base_gfn;
 	new.npages = npages;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region()
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (6 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

The helper x86_set_memory_region() is only used in vmx_set_tss_addr()
and kvm_arch_destroy_vm().  Push the lock upper in both cases.  With
that, drop x86_set_memory_region().

This prepares to allow __x86_set_memory_region() to return a HVA
mapped, because the HVA will need to be protected by the lock too even
after __x86_set_memory_region() returns.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/vmx/vmx.c          |  7 +++++--
 arch/x86/kvm/x86.c              | 22 +++++++---------------
 3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 159a28512e4c..eb6673c7d2e3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1619,7 +1619,6 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 int kvm_is_in_guest(void);
 
 int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
-int x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7add2fc8d8e9..7e3d370209e0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4482,8 +4482,11 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 	if (enable_unrestricted_guest)
 		return 0;
 
-	ret = x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
-				    PAGE_SIZE * 3);
+	mutex_lock(&kvm->slots_lock);
+	ret = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
+				      PAGE_SIZE * 3);
+	mutex_unlock(&kvm->slots_lock);
+
 	if (ret)
 		return ret;
 	to_kvm_vmx(kvm)->tss_addr = addr;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 93bbbce67a03..c4d3972dcd14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9636,18 +9636,6 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
 }
 EXPORT_SYMBOL_GPL(__x86_set_memory_region);
 
-int x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
-{
-	int r;
-
-	mutex_lock(&kvm->slots_lock);
-	r = __x86_set_memory_region(kvm, id, gpa, size);
-	mutex_unlock(&kvm->slots_lock);
-
-	return r;
-}
-EXPORT_SYMBOL_GPL(x86_set_memory_region);
-
 void kvm_arch_pre_destroy_vm(struct kvm *kvm)
 {
 	kvm_mmu_pre_destroy_vm(kvm);
@@ -9661,9 +9649,13 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 		 * unless the the memory map has changed due to process exit
 		 * or fd copying.
 		 */
-		x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, 0, 0);
-		x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT, 0, 0);
-		x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
+		mutex_lock(&kvm->slots_lock);
+		__x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
+					0, 0);
+		__x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
+					0, 0);
+		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
+		mutex_unlock(&kvm->slots_lock);
 	}
 	if (kvm_x86_ops->vm_destroy)
 		kvm_x86_ops->vm_destroy(kvm);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (7 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-19  9:01   ` Paolo Bonzini
  2020-01-21 15:56   ` Sean Christopherson
  2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
                   ` (14 subsequent siblings)
  23 siblings, 2 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.  We need to make sure that we have the
slots_lock held when accessing the userspace address.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/svm.c              |  3 +-
 arch/x86/kvm/vmx/vmx.c          | 68 ++++++++++++++++-----------------
 arch/x86/kvm/x86.c              | 18 +++++++--
 4 files changed, 51 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index eb6673c7d2e3..f536d139b3d2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1618,7 +1618,8 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
 int kvm_is_in_guest(void);
 
-int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
+int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
+			    unsigned long *uaddr);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8f1b715dfde8..03a344ce7b66 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1698,7 +1698,8 @@ static int avic_init_access_page(struct kvm_vcpu *vcpu)
 	ret = __x86_set_memory_region(kvm,
 				      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
 				      APIC_DEFAULT_PHYS_BASE,
-				      PAGE_SIZE);
+				      PAGE_SIZE,
+				      NULL);
 	if (ret)
 		goto out;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7e3d370209e0..62175a246bcc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3441,34 +3441,28 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-static int init_rmode_tss(struct kvm *kvm)
+static int init_rmode_tss(struct kvm *kvm, unsigned long *uaddr)
 {
-	gfn_t fn;
+	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
 	u16 data = 0;
 	int idx, r;
 
-	idx = srcu_read_lock(&kvm->srcu);
-	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
-	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
+	for (idx = 0; idx < 3; idx++) {
+		r = __copy_to_user((void __user *)uaddr + PAGE_SIZE * idx,
+				   zero_page, PAGE_SIZE);
+		if (r)
+			return -EFAULT;
+	}
+
 	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
-	r = kvm_write_guest_page(kvm, fn++, &data,
-			TSS_IOPB_BASE_OFFSET, sizeof(u16));
-	if (r < 0)
-		goto out;
-	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
-	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
+	r = __copy_to_user((void __user *)uaddr + TSS_IOPB_BASE_OFFSET,
+			   &data, sizeof(data));
+	if (r)
+		return -EFAULT;
+
 	data = ~0;
-	r = kvm_write_guest_page(kvm, fn, &data,
-				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
-				 sizeof(u8));
-out:
-	srcu_read_unlock(&kvm->srcu, idx);
+	r = __copy_to_user((void __user *)uaddr - 1, &data, sizeof(data));
+
 	return r;
 }
 
@@ -3478,6 +3472,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	int i, r = 0;
 	kvm_pfn_t identity_map_pfn;
 	u32 tmp;
+	unsigned long *uaddr = NULL;
 
 	/* Protect kvm_vmx->ept_identity_pagetable_done. */
 	mutex_lock(&kvm->slots_lock);
@@ -3490,21 +3485,21 @@ static int init_rmode_identity_map(struct kvm *kvm)
 	identity_map_pfn = kvm_vmx->ept_identity_map_addr >> PAGE_SHIFT;
 
 	r = __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
-				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE);
+				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE,
+				    uaddr);
 	if (r < 0)
 		goto out;
 
-	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
-	if (r < 0)
-		goto out;
 	/* Set up identity-mapping pagetable for EPT in real mode */
 	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
 		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
 			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
-		r = kvm_write_guest_page(kvm, identity_map_pfn,
-				&tmp, i * sizeof(tmp), sizeof(tmp));
-		if (r < 0)
+		r = __copy_to_user((void __user *)uaddr + i * sizeof(tmp),
+				   &tmp, sizeof(tmp));
+		if (r) {
+			r = -EFAULT;
 			goto out;
+		}
 	}
 	kvm_vmx->ept_identity_pagetable_done = true;
 
@@ -3537,7 +3532,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
 	if (kvm->arch.apic_access_page_done)
 		goto out;
 	r = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
-				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE);
+				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE, NULL);
 	if (r)
 		goto out;
 
@@ -4478,19 +4473,22 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
 	int ret;
+	unsigned long *uaddr = NULL;
 
 	if (enable_unrestricted_guest)
 		return 0;
 
 	mutex_lock(&kvm->slots_lock);
 	ret = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
-				      PAGE_SIZE * 3);
-	mutex_unlock(&kvm->slots_lock);
-
+				      PAGE_SIZE * 3, uaddr);
 	if (ret)
-		return ret;
+		goto out;
+
 	to_kvm_vmx(kvm)->tss_addr = addr;
-	return init_rmode_tss(kvm);
+	ret = init_rmode_tss(kvm, uaddr);
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return ret;
 }
 
 static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c4d3972dcd14..ff97782b3919 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
 	kvm_free_pit(kvm);
 }
 
-int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
+/*
+ * If `uaddr' is specified, `*uaddr' will be returned with the
+ * userspace address that was just allocated.  `uaddr' is only
+ * meaningful if the function returns zero, and `uaddr' will only be
+ * valid when with either the slots_lock or with the SRCU read lock
+ * held.  After we release the lock, the returned `uaddr' will be invalid.
+ */
+int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
+			    unsigned long *uaddr)
 {
 	int i, r;
 	unsigned long hva;
@@ -9608,6 +9616,8 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
 			      MAP_SHARED | MAP_ANONYMOUS, 0);
 		if (IS_ERR((void *)hva))
 			return PTR_ERR((void *)hva);
+		if (uaddr)
+			*uaddr = hva;
 	} else {
 		if (!slot->npages)
 			return 0;
@@ -9651,10 +9661,10 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 		 */
 		mutex_lock(&kvm->slots_lock);
 		__x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
-					0, 0);
+					0, 0, NULL);
 		__x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
-					0, 0);
-		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
+					0, 0, NULL);
+		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0, NULL);
 		mutex_unlock(&kvm->slots_lock);
 	}
 	if (kvm_x86_ops->vm_destroy)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (8 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

The context will be needed to implement the kvm dirty ring.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1fd204f27028..028dfc27479b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -144,7 +144,9 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_memory_slot *memslot,
+				    gfn_t gfn);
 
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -2062,7 +2064,8 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
-static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
+static int __kvm_write_guest_page(struct kvm *kvm,
+				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
 {
 	int r;
@@ -2074,7 +2077,7 @@ static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
 	r = __copy_to_user((void __user *)addr + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
 	return 0;
 }
 
@@ -2083,7 +2086,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -2092,7 +2095,7 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_write_guest_page(slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
@@ -2206,7 +2209,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
 	if (r)
 		return -EFAULT;
-	mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
+	mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT);
 
 	return 0;
 }
@@ -2271,7 +2274,8 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+static void mark_page_dirty_in_slot(struct kvm *kvm,
+				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
 	if (memslot && memslot->dirty_bitmap) {
@@ -2286,7 +2290,7 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = gfn_to_memslot(kvm, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(kvm, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty);
 
@@ -2295,7 +2299,7 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn)
 	struct kvm_memory_slot *memslot;
 
 	memslot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	mark_page_dirty_in_slot(memslot, gfn);
+	mark_page_dirty_in_slot(vcpu->kvm, memslot, gfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (9 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

From: Paolo Bonzini <pbonzini@redhat.com>

For ring-based dirty log tracking, it will be more efficient to account
writes during schedule-out or schedule-in to the currently running VCPU.
We would like to do it even if the write doesn't use the current VCPU's
address space, as is the case for cached writes (see commit 4e335d9e7ddb,
"Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
There is already something similar in KVM/ARM; one important difference
is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
we have to update both the architecture-independent vcpu_{load,put} and
the preempt notifiers.

Another change made in the process is to allow using kvm_get_running_vcpu()
in preemptible code.  This is allowed because preempt notifiers ensure
that the value does not change even after the VCPU thread is migrated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/include/asm/kvm_host.h   |  2 --
 arch/arm64/include/asm/kvm_host.h |  2 --
 include/linux/kvm_host.h          |  3 +++
 virt/kvm/arm/arch_timer.c         |  2 +-
 virt/kvm/arm/arm.c                | 29 -----------------------------
 virt/kvm/arm/perf.c               |  6 +++---
 virt/kvm/arm/vgic/vgic-mmio.c     | 15 +++------------
 virt/kvm/kvm_main.c               | 25 ++++++++++++++++++++++++-
 8 files changed, 34 insertions(+), 50 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 556cd818eccf..abc3f6f3ad76 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -284,8 +284,6 @@ int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index c61260cf63c5..12302f9035f9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -446,8 +446,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 763adf8c47b0..cbd633ece959 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1340,6 +1340,9 @@ static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
 }
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 
+struct kvm_vcpu *kvm_get_running_vcpu(void);
+struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
+
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index f182b2380345..63dd6f27997c 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -1022,7 +1022,7 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_timer_get_input_level(int vintid)
 {
-	struct kvm_vcpu *vcpu = kvm_arm_get_running_vcpu();
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 	struct arch_timer_context *timer;
 
 	if (vintid == vcpu_vtimer(vcpu)->irq.irq)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 8de4daf25097..b00a9870e5ec 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -51,9 +51,6 @@ __asm__(".arch_extension	virt");
 DEFINE_PER_CPU(kvm_host_data_t, kvm_host_data);
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
 
-/* Per-CPU variable containing the currently running vcpu. */
-static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
-
 /* The VMID used in the VTTBR */
 static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 static u32 kvm_next_vmid;
@@ -62,31 +59,8 @@ static DEFINE_SPINLOCK(kvm_vmid_lock);
 static bool vgic_present;
 
 static DEFINE_PER_CPU(unsigned char, kvm_arm_hardware_enabled);
-
-static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
-{
-	__this_cpu_write(kvm_arm_running_vcpu, vcpu);
-}
-
 DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
-/**
- * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
- * Must be called from non-preemptible context
- */
-struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
-{
-	return __this_cpu_read(kvm_arm_running_vcpu);
-}
-
-/**
- * kvm_arm_get_running_vcpus - get the per-CPU array of currently running vcpus.
- */
-struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
-{
-	return &kvm_arm_running_vcpu;
-}
-
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
 	return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
@@ -406,7 +380,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vcpu->cpu = cpu;
 	vcpu->arch.host_cpu_context = &cpu_data->host_ctxt;
 
-	kvm_arm_set_running_vcpu(vcpu);
 	kvm_vgic_load(vcpu);
 	kvm_timer_vcpu_load(vcpu);
 	kvm_vcpu_load_sysregs(vcpu);
@@ -432,8 +405,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_vcpu_pmu_restore_host(vcpu);
 
 	vcpu->cpu = -1;
-
-	kvm_arm_set_running_vcpu(NULL);
 }
 
 static void vcpu_power_off(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/arm/perf.c b/virt/kvm/arm/perf.c
index 918cdc3839ea..d45b8b9a4415 100644
--- a/virt/kvm/arm/perf.c
+++ b/virt/kvm/arm/perf.c
@@ -13,14 +13,14 @@
 
 static int kvm_is_in_guest(void)
 {
-        return kvm_arm_get_running_vcpu() != NULL;
+        return kvm_get_running_vcpu() != NULL;
 }
 
 static int kvm_is_user_mode(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return !vcpu_mode_priv(vcpu);
@@ -32,7 +32,7 @@ static unsigned long kvm_get_guest_ip(void)
 {
 	struct kvm_vcpu *vcpu;
 
-	vcpu = kvm_arm_get_running_vcpu();
+	vcpu = kvm_get_running_vcpu();
 
 	if (vcpu)
 		return *vcpu_pc(vcpu);
diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c
index 0d090482720d..d656ebd5f9d4 100644
--- a/virt/kvm/arm/vgic/vgic-mmio.c
+++ b/virt/kvm/arm/vgic/vgic-mmio.c
@@ -190,15 +190,6 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu,
  * value later will give us the same value as we update the per-CPU variable
  * in the preempt notifier handlers.
  */
-static struct kvm_vcpu *vgic_get_mmio_requester_vcpu(void)
-{
-	struct kvm_vcpu *vcpu;
-
-	preempt_disable();
-	vcpu = kvm_arm_get_running_vcpu();
-	preempt_enable();
-	return vcpu;
-}
 
 /* Must be called with irq->irq_lock held */
 static void vgic_hw_irq_spending(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
@@ -221,7 +212,7 @@ void vgic_mmio_write_spending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -274,7 +265,7 @@ void vgic_mmio_write_cpending(struct kvm_vcpu *vcpu,
 			      gpa_t addr, unsigned int len,
 			      unsigned long val)
 {
-	bool is_uaccess = !vgic_get_mmio_requester_vcpu();
+	bool is_uaccess = !kvm_get_running_vcpu();
 	u32 intid = VGIC_ADDR_TO_INTID(addr, 1);
 	int i;
 	unsigned long flags;
@@ -335,7 +326,7 @@ static void vgic_mmio_change_active(struct kvm_vcpu *vcpu, struct vgic_irq *irq,
 				    bool active)
 {
 	unsigned long flags;
-	struct kvm_vcpu *requester_vcpu = vgic_get_mmio_requester_vcpu();
+	struct kvm_vcpu *requester_vcpu = kvm_get_running_vcpu();
 
 	raw_spin_lock_irqsave(&irq->irq_lock, flags);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 028dfc27479b..5bbd8b8730fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -108,6 +108,7 @@ struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
 static __read_mostly struct preempt_ops kvm_preempt_ops;
+static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
 
 struct dentry *kvm_debugfs_dir;
 EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
@@ -199,6 +200,8 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
+
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
@@ -210,6 +213,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	__this_cpu_write(kvm_running_vcpu, NULL);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);
@@ -4297,8 +4301,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	WRITE_ONCE(vcpu->preempted, false);
 	WRITE_ONCE(vcpu->ready, false);
 
+	__this_cpu_write(kvm_running_vcpu, vcpu);
 	kvm_arch_sched_in(vcpu, cpu);
-
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -4312,6 +4316,25 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 		WRITE_ONCE(vcpu->ready, true);
 	}
 	kvm_arch_vcpu_put(vcpu);
+	__this_cpu_write(kvm_running_vcpu, NULL);
+}
+
+/**
+ * kvm_get_running_vcpu - get the vcpu running on the current CPU.
+ * Thanks to preempt notifiers, this can also be called from
+ * preemptible context.
+ */
+struct kvm_vcpu *kvm_get_running_vcpu(void)
+{
+        return __this_cpu_read(kvm_running_vcpu);
+}
+
+/**
+ * kvm_get_running_vcpus - get the per-CPU array of currently running vcpus.
+ */
+struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void)
+{
+        return &kvm_running_vcpu;
 }
 
 static void check_processor_compat(void *rtn)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (10 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 16:29   ` Michael S. Tsirkin
                     ` (4 more replies)
  2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
                   ` (11 subsequent siblings)
  23 siblings, 5 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert,
	Lei Cao

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt  |  89 ++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/include/uapi/asm/kvm.h |   1 +
 arch/x86/kvm/Makefile           |   3 +-
 arch/x86/kvm/mmu/mmu.c          |   6 ++
 arch/x86/kvm/vmx/vmx.c          |   7 ++
 arch/x86/kvm/x86.c              |   9 ++
 include/linux/kvm_dirty_ring.h  |  55 +++++++++++
 include/linux/kvm_host.h        |  26 +++++
 include/trace/events/kvm.h      |  78 +++++++++++++++
 include/uapi/linux/kvm.h        |  33 +++++++
 virt/kvm/dirty_ring.c           | 162 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             | 137 ++++++++++++++++++++++++++-
 13 files changed, 606 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 create mode 100644 virt/kvm/dirty_ring.c

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index ebb37b34dcfc..708c3e0f7eae 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
 It is thus encouraged to use the vm ioctl to query for capabilities (available
 with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
 
+
 4.5 KVM_GET_VCPU_MMAP_SIZE
 
 Capability: basic
@@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
 memory region.  This ioctl returns the size of that region.  See the
 KVM_RUN documentation for details.
 
+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+  KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
+  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+
 
 4.6 KVM_SET_MEMORY_REGION
 
@@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
 AArch64, this value will be reported in the ISS field of ESR_ELx.
 
 See KVM_CAP_VCPU_EVENTS for more details.
+
 8.20 KVM_CAP_HYPERV_SEND_IPI
 
 Architectures: x86
@@ -5383,6 +5397,7 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
 
 Architecture: x86
@@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
 flush hypercalls by Hyper-V) so userspace should disable KVM identification
 in CPUID and only exposes Hyper-V identification. In this case, guest
 thinks it's running on Hyper-V and only use Hyper-V hypercalls.
+
+8.22 KVM_CAP_DIRTY_LOG_RING
+
+Architectures: x86
+Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu.
+
+One dirty ring is defined as below internally:
+
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	struct kvm_dirty_gfn *dirty_gfns;
+	struct kvm_dirty_ring_indices *indices;
+	int index;
+};
+
+Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
+For each of the dirty entry it's defined as:
+
+struct kvm_dirty_gfn {
+        __u32 pad;
+        __u32 slot; /* as_id | slot_id */
+        __u64 offset;
+};
+
+Most of the ring structure is used by KVM internally, while only the
+indices are exposed to userspace:
+
+struct kvm_dirty_ring_indices {
+	__u32 avail_index; /* set by kernel */
+	__u32 fetch_index; /* set by userspace */
+};
+
+The two indices in the ring buffer are free running counters.
+
+Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
+to enable this capability for the new guest and set the size of the
+rings.  It is only allowed before creating any vCPU, and the size of
+the ring must be a power of two.  The larger the ring buffer, the less
+likely the ring is full and the VM is forced to exit to userspace. The
+optimal size depends on the workload, but it is recommended that it be
+at least 64 KiB (4096 entries).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs up to avail_index, and sets the fetch_index
+accordingly.  This can be done when the guest is running or paused,
+and dirty pages need not be collected all at once.  After processing
+one or more entries in the ring buffer, userspace calls the VM ioctl
+KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
+fetch_index and to mark those pages clean.  Therefore, the ioctl
+must be called *before* reading the content of the dirty pages.
+
+However, there is a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
+userspace it's still possible that the kernel has not yet flushed the
+hardware dirty buffers into the kernel buffer (which was previously
+done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one needs to
+kick the vcpu out for a hardware buffer flush (vmexit) to make sure
+all the existing dirty gfns are flushed to the dirty rings.
+
+If one of the ring buffers is full, the guest will exit to userspace
+with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
+ioctl will return to userspace with zero.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f536d139b3d2..3fe18402e6a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1181,6 +1181,7 @@ struct kvm_x86_ops {
 					   struct kvm_memory_slot *slot,
 					   gfn_t offset, unsigned long mask);
 	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+	int (*cpu_dirty_log_size)(void);
 
 	/* pmu operations of sub-arch */
 	const struct kvm_pmu_ops *pmu_ops;
@@ -1666,4 +1667,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define GET_SMSTATE(type, buf, offset)		\
 	(*(type *)((buf) + (offset) - 0x7e00))
 
+int kvm_cpu_dirty_log_size(void);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 503d3f42da16..b59bf356c478 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -12,6 +12,7 @@
 
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index b19ef421084d..0acee817adfb 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
+				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
+				$(KVM)/dirty_ring.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7269130ea5e2..621b842a9b7b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1832,7 +1832,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
 {
 	if (kvm_x86_ops->write_log_dirty)
 		return kvm_x86_ops->write_log_dirty(vcpu);
+	return 0;
+}
 
+int kvm_cpu_dirty_log_size(void)
+{
+	if (kvm_x86_ops->cpu_dirty_log_size)
+		return kvm_x86_ops->cpu_dirty_log_size();
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 62175a246bcc..2151de89456d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7689,6 +7689,7 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->slot_disable_log_dirty = NULL;
 		kvm_x86_ops->flush_log_dirty = NULL;
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
+		kvm_x86_ops->cpu_dirty_log_size = NULL;
 	}
 
 	if (!cpu_has_vmx_preemption_timer())
@@ -7753,6 +7754,11 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static int vmx_cpu_dirty_log_size(void)
+{
+	return enable_pml ? PML_ENTITY_NUM : 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7875,6 +7881,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.flush_log_dirty = vmx_flush_log_dirty,
 	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
 	.write_log_dirty = vmx_write_pml_buffer,
+	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
 
 	.pre_block = vmx_pre_block,
 	.post_block = vmx_post_block,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff97782b3919..9c3673592826 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7998,6 +7998,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	bool req_immediate_exit = false;
 
+	/* Forbid vmenter if vcpu dirty ring is soft-full */
+	if (unlikely(vcpu->kvm->dirty_ring_size &&
+		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+		trace_kvm_dirty_ring_exit(vcpu);
+		r = 0;
+		goto out;
+	}
+
 	if (kvm_request_pending(vcpu)) {
 		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) {
 			if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) {
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
new file mode 100644
index 000000000000..d6fe9e1b7617
--- /dev/null
+++ b/include/linux/kvm_dirty_ring.h
@@ -0,0 +1,55 @@
+#ifndef KVM_DIRTY_RING_H
+#define KVM_DIRTY_RING_H
+
+/**
+ * kvm_dirty_ring: KVM internal dirty ring structure
+ *
+ * @dirty_index: free running counter that points to the next slot in
+ *               dirty_ring->dirty_gfns, where a new dirty page should go
+ * @reset_index: free running counter that points to the next dirty page
+ *               in dirty_ring->dirty_gfns for which dirty trap needs to
+ *               be reenabled
+ * @size:        size of the compact list, dirty_ring->dirty_gfns
+ * @soft_limit:  when the number of dirty pages in the list reaches this
+ *               limit, vcpu that owns this ring should exit to userspace
+ *               to allow userspace to harvest all the dirty pages
+ * @dirty_gfns:  the array to keep the dirty gfns
+ * @indices:     the pointer to the @kvm_dirty_ring_indices structure
+ *               of this specific ring
+ * @index:       index of this dirty ring
+ */
+struct kvm_dirty_ring {
+	u32 dirty_index;
+	u32 reset_index;
+	u32 size;
+	u32 soft_limit;
+	struct kvm_dirty_gfn *dirty_gfns;
+	struct kvm_dirty_ring_indices *indices;
+	int index;
+};
+
+u32 kvm_dirty_ring_get_rsvd_entries(void);
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
+			 struct kvm_dirty_ring_indices *indices,
+			 int index, u32 size);
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
+
+/*
+ * called with kvm->slots_lock held, returns the number of
+ * processed pages.
+ */
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+
+/*
+ * returns =0: successfully pushed
+ *         <0: unable to push, need to wait
+ */
+void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
+
+/* for use in vm_operations_struct */
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
+
+#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cbd633ece959..c96161c6a0c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
 #include <linux/kvm_types.h>
 
 #include <asm/kvm_host.h>
+#include <linux/kvm_dirty_ring.h>
 
 #ifndef KVM_MAX_VCPU_ID
 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -321,6 +322,7 @@ struct kvm_vcpu {
 	bool ready;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	struct kvm_dirty_ring dirty_ring;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -502,6 +504,7 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	u32 dirty_ring_size;
 };
 
 #define kvm_err(fmt, ...) \
@@ -831,6 +834,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 					gfn_t gfn_offset,
 					unsigned long mask);
 
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
+
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 				struct kvm_dirty_log *log);
 int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
@@ -1409,4 +1414,25 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 				uintptr_t data, const char *name,
 				struct task_struct **thread_ptr);
 
+/*
+ * This defines how many reserved entries we want to keep before we
+ * kick the vcpu to the userspace to avoid dirty ring full.  This
+ * value can be tuned to higher if e.g. PML is enabled on the host.
+ */
+#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
+
+/* Max number of entries allowed for each kvm dirty ring */
+#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
+
+/*
+ * Arch needs to define these macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures, while
+ * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
+ * feature is off on all archs.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+
 #endif
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 2c735a3e6613..3d850997940c 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -399,6 +399,84 @@ TRACE_EVENT(kvm_halt_poll_ns,
 #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
 	trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
 
+TRACE_EVENT(kvm_dirty_ring_push,
+	TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
+	TP_ARGS(ring, slot, offset),
+
+	TP_STRUCT__entry(
+		__field(int, index)
+		__field(u32, dirty_index)
+		__field(u32, reset_index)
+		__field(u32, slot)
+		__field(u64, offset)
+	),
+
+	TP_fast_assign(
+		__entry->index          = ring->index;
+		__entry->dirty_index    = ring->dirty_index;
+		__entry->reset_index    = ring->reset_index;
+		__entry->slot           = slot;
+		__entry->offset         = offset;
+	),
+
+	TP_printk("ring %d: dirty 0x%x reset 0x%x "
+		  "slot %u offset 0x%llx (used %u)",
+		  __entry->index, __entry->dirty_index,
+		  __entry->reset_index,  __entry->slot, __entry->offset,
+		  __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_reset,
+	TP_PROTO(struct kvm_dirty_ring *ring),
+	TP_ARGS(ring),
+
+	TP_STRUCT__entry(
+		__field(int, index)
+		__field(u32, dirty_index)
+		__field(u32, reset_index)
+	),
+
+	TP_fast_assign(
+		__entry->index          = ring->index;
+		__entry->dirty_index    = ring->dirty_index;
+		__entry->reset_index    = ring->reset_index;
+	),
+
+	TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
+		  __entry->index, __entry->dirty_index, __entry->reset_index,
+		  __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_waitqueue,
+	TP_PROTO(bool enter),
+	TP_ARGS(enter),
+
+	TP_STRUCT__entry(
+	    __field(bool, enter)
+	),
+
+	TP_fast_assign(
+	    __entry->enter = enter;
+	),
+
+	TP_printk("%s", __entry->enter ? "wait" : "awake")
+);
+
+TRACE_EVENT(kvm_dirty_ring_exit,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+
+	TP_STRUCT__entry(
+	    __field(int, vcpu_id)
+	),
+
+	TP_fast_assign(
+	    __entry->vcpu_id = vcpu->vcpu_id;
+	),
+
+	TP_printk("vcpu %d", __entry->vcpu_id)
+);
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f0a16b4adbbd..df4a1700ff1e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -247,6 +248,13 @@ struct kvm_hyperv_exit {
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
 
+struct kvm_dirty_ring_indices {
+	__u32 avail_index; /* set by kernel */
+	__u32 padding1;
+	__u32 fetch_index; /* set by userspace */
+	__u32 padding2;
+};
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -421,6 +429,8 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	struct kvm_dirty_ring_indices vcpu_ring_indices;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1009,6 +1019,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
 #define KVM_CAP_ARM_NISV_TO_USER 177
 #define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_DIRTY_LOG_RING 179
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1484,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1623,4 +1637,23 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ *    of kvm_write_* so that the global dirty ring is not filled up
+ *    too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ *    enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ *    dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+	__u32 pad;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
new file mode 100644
index 000000000000..67ec5bbc21c0
--- /dev/null
+++ b/virt/kvm/dirty_ring.c
@@ -0,0 +1,162 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * KVM dirty ring implementation
+ *
+ * Copyright 2019 Red Hat, Inc.
+ */
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_dirty_ring.h>
+#include <trace/events/kvm.h>
+
+int __weak kvm_cpu_dirty_log_size(void)
+{
+	return 0;
+}
+
+u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
+}
+
+static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
+{
+	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
+}
+
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
+}
+
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
+{
+	return kvm_dirty_ring_used(ring) >= ring->size;
+}
+
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	WARN_ON_ONCE(vcpu->kvm != kvm);
+
+	return &vcpu->dirty_ring;
+}
+
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
+			 struct kvm_dirty_ring_indices *indices,
+			 int index, u32 size)
+{
+	ring->dirty_gfns = vmalloc(size);
+	if (!ring->dirty_gfns)
+		return -ENOMEM;
+	memset(ring->dirty_gfns, 0, size);
+
+	ring->size = size / sizeof(struct kvm_dirty_gfn);
+	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
+	ring->dirty_index = 0;
+	ring->reset_index = 0;
+	ring->index = index;
+	ring->indices = indices;
+
+	return 0;
+}
+
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+	u32 cur_slot, next_slot;
+	u64 cur_offset, next_offset;
+	unsigned long mask;
+	u32 fetch;
+	int count = 0;
+	struct kvm_dirty_gfn *entry;
+	struct kvm_dirty_ring_indices *indices = ring->indices;
+	bool first_round = true;
+
+	fetch = READ_ONCE(indices->fetch_index);
+
+	/*
+	 * Note that fetch_index is written by the userspace, which
+	 * should not be trusted.  If this happens, then it's probably
+	 * that the userspace has written a wrong fetch_index.
+	 */
+	if (fetch - ring->reset_index > ring->size)
+		return -EINVAL;
+
+	if (fetch == ring->reset_index)
+		return 0;
+
+	/* This is only needed to make compilers happy */
+	cur_slot = cur_offset = mask = 0;
+	while (ring->reset_index != fetch) {
+		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+		next_slot = READ_ONCE(entry->slot);
+		next_offset = READ_ONCE(entry->offset);
+		ring->reset_index++;
+		count++;
+		/*
+		 * Try to coalesce the reset operations when the guest is
+		 * scanning pages in the same slot.
+		 */
+		if (!first_round && next_slot == cur_slot) {
+			s64 delta = next_offset - cur_offset;
+
+			if (delta >= 0 && delta < BITS_PER_LONG) {
+				mask |= 1ull << delta;
+				continue;
+			}
+
+			/* Backwards visit, careful about overflows!  */
+			if (delta > -BITS_PER_LONG && delta < 0 &&
+			    (mask << -delta >> -delta) == mask) {
+				cur_offset = next_offset;
+				mask = (mask << -delta) | 1;
+				continue;
+			}
+		}
+		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+		cur_slot = next_slot;
+		cur_offset = next_offset;
+		mask = 1;
+		first_round = false;
+	}
+	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+	trace_kvm_dirty_ring_reset(ring);
+
+	return count;
+}
+
+void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
+{
+	struct kvm_dirty_gfn *entry;
+	struct kvm_dirty_ring_indices *indices = ring->indices;
+
+	/* It should never get full */
+	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
+
+	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+	entry->slot = slot;
+	entry->offset = offset;
+	/*
+	 * Make sure the data is filled in before we publish this to
+	 * the userspace program.  There's no paired kernel-side reader.
+	 */
+	smp_wmb();
+	ring->dirty_index++;
+	WRITE_ONCE(indices->avail_index, ring->dirty_index);
+
+	trace_kvm_dirty_ring_push(ring, slot, offset);
+}
+
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
+{
+	return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
+}
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+	vfree(ring->dirty_gfns);
+	ring->dirty_gfns = NULL;
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5bbd8b8730fa..5e36792e15ae 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -64,6 +64,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
 
+#include <linux/kvm_dirty_ring.h>
+
 /* Worst case buffer size needed for holding an integer. */
 #define ITOA_MAX_LEN 12
 
@@ -357,11 +359,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->preempted = false;
 	vcpu->ready = false;
 
+	if (kvm->dirty_ring_size) {
+		r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
+					 &vcpu->run->vcpu_ring_indices,
+					 id, kvm->dirty_ring_size);
+		if (r)
+			goto fail_free_run;
+	}
+
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
-		goto fail_free_run;
+		goto fail_free_ring;
 	return 0;
 
+fail_free_ring:
+	if (kvm->dirty_ring_size)
+		kvm_dirty_ring_free(&vcpu->dirty_ring);
 fail_free_run:
 	free_page((unsigned long)vcpu->run);
 fail:
@@ -379,6 +392,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
+	if (vcpu->kvm->dirty_ring_size)
+		kvm_dirty_ring_free(&vcpu->dirty_ring);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
 
@@ -2284,8 +2299,13 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 {
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
+		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-		set_bit_le(rel_gfn, memslot->dirty_bitmap);
+		if (kvm->dirty_ring_size)
+			kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
+					    slot, rel_gfn);
+		else
+			set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
 
@@ -2632,6 +2652,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
+static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
+{
+	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+		return false;
+
+	return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
+	    (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
+	     kvm->dirty_ring_size / PAGE_SIZE);
+}
+
 static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 {
 	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
@@ -2647,6 +2677,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
 	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
 		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
 #endif
+	else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff))
+		page = kvm_dirty_ring_get_page(
+		    &vcpu->dirty_ring,
+		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
 	else
 		return kvm_arch_vcpu_fault(vcpu, vmf);
 	get_page(page);
@@ -2660,6 +2694,15 @@ static const struct vm_operations_struct kvm_vcpu_vm_ops = {
 
 static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct kvm_vcpu *vcpu = file->private_data;
+	unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	/* If to map any writable page within dirty ring, fail it */
+	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
+	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
+	    vma->vm_flags & VM_WRITE)
+		return -EINVAL;
+
 	vma->vm_ops = &kvm_vcpu_vm_ops;
 	return 0;
 }
@@ -3242,12 +3285,97 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 	case KVM_CAP_NR_MEMSLOTS:
 		return KVM_USER_MEM_SLOTS;
+	case KVM_CAP_DIRTY_LOG_RING:
+#ifdef CONFIG_X86
+		return KVM_DIRTY_RING_MAX_ENTRIES;
+#else
+		return 0;
+#endif
 	default:
 		break;
 	}
 	return kvm_vm_ioctl_check_extension(kvm, arg);
 }
 
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+	struct kvm_memory_slot *memslot;
+	int as_id, id;
+
+	as_id = slot >> 16;
+	id = (u16)slot;
+	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+		return;
+
+	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+	if (offset >= memslot->npages)
+		return;
+
+	spin_lock(&kvm->mmu_lock);
+	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+	spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
+{
+	int r;
+
+	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+		return -EINVAL;
+
+	/* the size should be power of 2 */
+	if (!size || (size & (size - 1)))
+		return -EINVAL;
+
+	/* Should be bigger to keep the reserved entries, or a page */
+	if (size < kvm_dirty_ring_get_rsvd_entries() *
+	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
+		return -EINVAL;
+
+	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
+	    sizeof(struct kvm_dirty_gfn))
+		return -E2BIG;
+
+	/* We only allow it to set once */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	if (kvm->created_vcpus) {
+		/* We don't allow to change this value after vcpu created */
+		r = -EINVAL;
+	} else {
+		kvm->dirty_ring_size = size;
+		r = 0;
+	}
+
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	int cleared = 0;
+
+	if (!kvm->dirty_ring_size)
+		return -EINVAL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (cleared)
+		kvm_flush_remote_tlbs(kvm);
+
+	return cleared;
+}
+
 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 						  struct kvm_enable_cap *cap)
 {
@@ -3265,6 +3393,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 		kvm->manual_dirty_log_protect = cap->args[0];
 		return 0;
 #endif
+	case KVM_CAP_DIRTY_LOG_RING:
+		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -3452,6 +3582,9 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_RESET_DIRTY_RINGS:
+		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
+		break;
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (11 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

There's no good reason to use both the dirty bitmap logging and the
new dirty ring buffer to track dirty bits.  We should be able to even
support both of them at the same time, but it could complicate things
which could actually help little.  Let's simply make it the rule
before we enable dirty ring on any arch, that we don't allow these two
interfaces to be used together.

The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
enablement.  That's where we'll switch from the default dirty logging
way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
correctly, we'll once and for all switch to the dirty ring buffer mode
for the current virtual machine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt |  7 +++++++
 virt/kvm/kvm_main.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 708c3e0f7eae..be176d1dd91f 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -5485,3 +5485,10 @@ all the existing dirty gfns are flushed to the dirty rings.
 If one of the ring buffers is full, the guest will exit to userspace
 with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
 ioctl will return to userspace with zero.
+
+NOTE: the KVM_CAP_DIRTY_LOG_RING capability and the new ioctl
+KVM_RESET_DIRTY_RINGS are exclusive to the existing KVM_GET_DIRTY_LOG
+interface.  After enabling KVM_CAP_DIRTY_LOG_RING with an acceptable
+dirty ring size, the virtual machine will switch to the dirty ring
+tracking mode, and KVM_GET_DIRTY_LOG, KVM_CLEAR_DIRTY_LOG ioctls will
+stop working.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e36792e15ae..f0f766183cb2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1211,6 +1211,10 @@ int kvm_get_dirty_log(struct kvm *kvm,
 	unsigned long n;
 	unsigned long any = 0;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1268,6 +1272,10 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
@@ -1339,6 +1347,10 @@ int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
 
+	/* Dirty ring tracking is exclusive to dirty log tracking */
+	if (kvm->dirty_ring_size)
+		return -EINVAL;
+
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (12 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 16:41   ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Because kvm dirty rings and kvm dirty log is used in an exclusive way,
Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
At the meantime, since the dirty_bitmap will be conditionally created
now, we can't use it as a sign of "whether this memory slot enabled
dirty tracking".  Change users like that to check against the kvm
memory slot flags.

Note that there still can be chances where the kvm memory slot got its
dirty_bitmap allocated, _if_ the memory slots are created before
enabling of the dirty rings and at the same time with the dirty
tracking capability enabled, they'll still with the dirty_bitmap.
However it should not hurt much (e.g., the bitmaps will always be
freed if they are there), and the real users normally won't trigger
this because dirty bit tracking flag should in most cases only be
applied to kvm slots only before migration starts, that should be far
latter than kvm initializes (VM starts).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/kvm_host.h | 5 +++++
 virt/kvm/kvm_main.c      | 5 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c96161c6a0c9..ab2a169b1264 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -353,6 +353,11 @@ struct kvm_memory_slot {
 	u8 as_id;
 };
 
+static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
+}
+
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
 {
 	return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f0f766183cb2..46da3169944f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1120,7 +1120,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	}
 
 	/* Allocate page dirty bitmap if needed */
-	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
+	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap &&
+	    !kvm->dirty_ring_size) {
 		if (kvm_create_dirty_bitmap(&new) < 0)
 			goto out_free;
 	}
@@ -2309,7 +2310,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
 				    struct kvm_memory_slot *memslot,
 				    gfn_t gfn)
 {
-	if (memslot && memslot->dirty_bitmap) {
+	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (13 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

We don't clear the dirty bitmap before because KVM_GET_DIRTY_LOG will
clear it for us before copying the dirty log onto it.  However we'd
still better to clear it explicitly instead of assuming the kernel
will always do it for us.

More importantly, in the upcoming dirty ring tests we'll start to
fetch dirty pages from a ring buffer, so no one is going to clear the
dirty bitmap for us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 5614222a6628..3c0ffd34b3b0 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -197,7 +197,7 @@ static void vm_dirty_log_verify(unsigned long *bmap)
 				    page);
 		}
 
-		if (test_bit_le(page, bmap)) {
+		if (test_and_clear_bit_le(page, bmap)) {
 			host_dirty_count++;
 			/*
 			 * If the bit is set, the value written onto
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (14 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

This will be needed to extend the kvm selftest program.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/include/uapi/linux/kvm.h | 38 ++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index f0a16b4adbbd..d2300a3cfbf0 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -247,6 +248,13 @@ struct kvm_hyperv_exit {
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
 
+struct kvm_dirty_ring_indices {
+	__u32 avail_index; /* set by kernel */
+	__u32 padding1;
+	__u32 fetch_index; /* set by userspace */
+	__u32 padding2;
+};
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -421,6 +429,13 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	struct kvm_dirty_ring_indices vcpu_ring_indices;
+};
+
+/* Returned by mmap(kvm->fd, offset=0) */
+struct kvm_vm_run {
+	struct kvm_dirty_ring_indices vm_ring_indices;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1009,6 +1024,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
 #define KVM_CAP_ARM_NISV_TO_USER 177
 #define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_DIRTY_LOG_RING 179
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1489,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1623,4 +1642,23 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ *    of kvm_write_* so that the global dirty ring is not filled up
+ *    too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ *    enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ *    dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+	__u32 pad;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (15 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Remove the clear_dirty_log test, instead merge it into the existing
dirty_log_test.  It should be cleaner to use this single binary to do
both tests, also it's a preparation for the upcoming dirty ring test.

The default test will still be the dirty_log test.  To run the clear
dirty log test, we need to specify "-M clear-log".

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 131 +++++++++++++++---
 3 files changed, 110 insertions(+), 25 deletions(-)
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 3138a916574a..130a7b1c7ad6 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -26,11 +26,9 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
-TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 
-TEST_GEN_PROGS_aarch64 += clear_dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 
diff --git a/tools/testing/selftests/kvm/clear_dirty_log_test.c b/tools/testing/selftests/kvm/clear_dirty_log_test.c
deleted file mode 100644
index 749336937d37..000000000000
--- a/tools/testing/selftests/kvm/clear_dirty_log_test.c
+++ /dev/null
@@ -1,2 +0,0 @@
-#define USE_CLEAR_DIRTY_LOG
-#include "dirty_log_test.c"
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3c0ffd34b3b0..a8ae8c0042a8 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -128,6 +128,66 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+enum log_mode_t {
+	/* Only use KVM_GET_DIRTY_LOG for logging */
+	LOG_MODE_DIRTY_LOG = 0,
+
+	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
+	LOG_MODE_CLERA_LOG = 1,
+
+	LOG_MODE_NUM,
+};
+
+/* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
+static enum log_mode_t host_log_mode;
+
+static void clear_log_create_vm_done(struct kvm_vm *vm)
+{
+	struct kvm_enable_cap cap = {};
+
+	if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)) {
+		fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n");
+		exit(KSFT_SKIP);
+	}
+
+	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
+	cap.args[0] = 1;
+	vm_enable_cap(vm, &cap);
+}
+
+static void dirty_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+}
+
+static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					  void *bitmap, uint32_t num_pages)
+{
+	kvm_vm_get_dirty_log(vm, slot, bitmap);
+	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
+}
+
+struct log_mode {
+	const char *name;
+	/* Hook when the vm creation is done (before vcpu creation) */
+	void (*create_vm_done)(struct kvm_vm *vm);
+	/* Hook to collect the dirty pages into the bitmap provided */
+	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
+				     void *bitmap, uint32_t num_pages);
+} log_modes[LOG_MODE_NUM] = {
+	{
+		.name = "dirty-log",
+		.create_vm_done = NULL,
+		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+	},
+	{
+		.name = "clear-log",
+		.create_vm_done = clear_log_create_vm_done,
+		.collect_dirty_pages = clear_log_collect_dirty_pages,
+	},
+};
+
 /*
  * We use this bitmap to track some pages that should have its dirty
  * bit set in the _next_ iteration.  For example, if we detected the
@@ -137,6 +197,33 @@ static uint64_t host_track_next_count;
  */
 static unsigned long *host_bmap_track;
 
+static void log_modes_dump(void)
+{
+	int i;
+
+	for (i = 0; i < LOG_MODE_NUM; i++)
+		printf("%s, ", log_modes[i].name);
+	puts("\b\b  \b\b");
+}
+
+static void log_mode_create_vm_done(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->create_vm_done)
+		mode->create_vm_done(vm);
+}
+
+static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					 void *bitmap, uint32_t num_pages)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	TEST_ASSERT(mode->collect_dirty_pages != NULL,
+		    "collect_dirty_pages() is required for any log mode!");
+	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -257,6 +344,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 #ifdef __x86_64__
 	vm_create_irqchip(vm);
 #endif
+	log_mode_create_vm_done(vm);
 	vm_vcpu_add_default(vm, vcpuid, guest_code);
 	return vm;
 }
@@ -316,14 +404,6 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	bmap = bitmap_alloc(host_num_pages);
 	host_bmap_track = bitmap_alloc(host_num_pages);
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	struct kvm_enable_cap cap = {};
-
-	cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
-	cap.args[0] = 1;
-	vm_enable_cap(vm, &cap);
-#endif
-
 	/* Add an extra memory slot for testing dirty logging */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
 				    guest_test_phys_mem,
@@ -364,11 +444,8 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 	while (iteration < iterations) {
 		/* Give the vcpu thread some time to dirty some pages */
 		usleep(interval * 1000);
-		kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
-#ifdef USE_CLEAR_DIRTY_LOG
-		kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0,
-				       host_num_pages);
-#endif
+		log_mode_collect_dirty_pages(vm, TEST_MEM_SLOT_INDEX,
+					     bmap, host_num_pages);
 		vm_dirty_log_verify(bmap);
 		iteration++;
 		sync_global_to_guest(vm, iteration);
@@ -413,6 +490,9 @@ static void help(char *name)
 	       TEST_HOST_LOOP_INTERVAL);
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
+	printf(" -M: specify the host logging mode "
+	       "(default: log-dirty).  Supported modes: \n\t");
+	log_modes_dump();
 	printf(" -m: specify the guest mode ID to test "
 	       "(default: test all supported modes)\n"
 	       "     This option may be used multiple times.\n"
@@ -437,13 +517,6 @@ int main(int argc, char *argv[])
 	unsigned int host_ipa_limit;
 #endif
 
-#ifdef USE_CLEAR_DIRTY_LOG
-	if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2)) {
-		fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n");
-		exit(KSFT_SKIP);
-	}
-#endif
-
 #ifdef __x86_64__
 	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
 #endif
@@ -463,7 +536,7 @@ int main(int argc, char *argv[])
 	vm_guest_mode_params_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:")) != -1) {
+	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
 		switch (opt) {
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
@@ -485,6 +558,22 @@ int main(int argc, char *argv[])
 				    "Guest mode ID %d too big", mode);
 			vm_guest_mode_params[mode].enabled = true;
 			break;
+		case 'M':
+			for (i = 0; i < LOG_MODE_NUM; i++) {
+				if (!strcmp(optarg, log_modes[i].name)) {
+					DEBUG("Setting log mode to: '%s'\n",
+					      optarg);
+					host_log_mode = i;
+					break;
+				}
+			}
+			if (i == LOG_MODE_NUM) {
+				printf("Log mode '%s' is invalid.  "
+				       "Please choose from: ", optarg);
+				log_modes_dump();
+				exit(-1);
+			}
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (16 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Provide a hook for the checks after vcpu_run() completes.  Preparation
for the dirty ring test because we'll need to take care of another
exit reason.

Since at it, drop the pages_count because after all we have a better
summary right now with statistics, and clean it up a bit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 39 ++++++++++++--------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index a8ae8c0042a8..3542311f56ff 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -168,6 +168,15 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
+static void default_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+		    "Invalid guest sync status: exit_reason=%s\n",
+		    exit_reason_str(run->exit_reason));
+}
+
 struct log_mode {
 	const char *name;
 	/* Hook when the vm creation is done (before vcpu creation) */
@@ -175,16 +184,20 @@ struct log_mode {
 	/* Hook to collect the dirty pages into the bitmap provided */
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
+	/* Hook to call when after each vcpu run */
+	void (*after_vcpu_run)(struct kvm_vm *vm);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
 		.create_vm_done = NULL,
 		.collect_dirty_pages = dirty_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 	{
 		.name = "clear-log",
 		.create_vm_done = clear_log_create_vm_done,
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
+		.after_vcpu_run = default_after_vcpu_run,
 	},
 };
 
@@ -224,6 +237,14 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
+static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->after_vcpu_run)
+		mode->after_vcpu_run(vm);
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -237,31 +258,17 @@ static void *vcpu_worker(void *data)
 	int ret;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
-	uint64_t pages_count = 0;
-	struct kvm_run *run;
-
-	run = vcpu_state(vm, VCPU_ID);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
-	generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 
 	while (!READ_ONCE(host_quit)) {
+		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		/* Let the guest dirty the random pages */
 		ret = _vcpu_run(vm, VCPU_ID);
 		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
-			pages_count += TEST_PAGES_PER_LOOP;
-			generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
-		} else {
-			TEST_ASSERT(false,
-				    "Invalid guest sync status: "
-				    "exit_reason=%s\n",
-				    exit_reason_str(run->exit_reason));
-		}
+		log_mode_after_vcpu_run(vm);
 	}
 
-	DEBUG("Dirtied %"PRIu64" pages\n", pages_count);
-
 	return NULL;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (17 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Add the initial dirty ring buffer test.

The current test implements the userspace dirty ring collection, by
only reaping the dirty ring when the ring is full.

So it's still running synchronously like this:

            vcpu                             main thread

  1. vcpu dirties pages
  2. vcpu gets dirty ring full
     (userspace exit)

                                       3. main thread waits until full
                                          (so hardware buffers flushed)
                                       4. main thread collects
                                       5. main thread continues vcpu

  6. vcpu continues, goes back to 1

We can't directly collects dirty bits during vcpu execution because
otherwise we can't guarantee the hardware dirty bits were flushed when
we collect and we're very strict on the dirty bits so otherwise we can
fail the future verify procedure.  A follow up patch will make this
test to support async just like the existing dirty log test, by adding
a vcpu kick mechanism.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 174 +++++++++++++++++-
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  64 +++++++
 .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
 4 files changed, 242 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3542311f56ff..6a551f285dea 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -12,8 +12,10 @@
 #include <unistd.h>
 #include <time.h>
 #include <pthread.h>
+#include <semaphore.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
+#include <asm/barrier.h>
 
 #include "test_util.h"
 #include "kvm_util.h"
@@ -57,6 +59,8 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
+#define TEST_DIRTY_RING_COUNT		1024
+
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
  * sync_global_to/from_guest() are used when accessing from
@@ -128,6 +132,10 @@ static uint64_t host_dirty_count;
 static uint64_t host_clear_count;
 static uint64_t host_track_next_count;
 
+/* Whether dirty ring reset is requested, or finished */
+static sem_t dirty_ring_vcpu_stop;
+static sem_t dirty_ring_vcpu_cont;
+
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
 	LOG_MODE_DIRTY_LOG = 0,
@@ -135,6 +143,9 @@ enum log_mode_t {
 	/* Use both KVM_[GET|CLEAR]_DIRTY_LOG for logging */
 	LOG_MODE_CLERA_LOG = 1,
 
+	/* Use dirty ring for logging */
+	LOG_MODE_DIRTY_RING = 2,
+
 	LOG_MODE_NUM,
 };
 
@@ -177,6 +188,118 @@ static void default_after_vcpu_run(struct kvm_vm *vm)
 		    exit_reason_str(run->exit_reason));
 }
 
+static void dirty_ring_create_vm_done(struct kvm_vm *vm)
+{
+	/*
+	 * Switch to dirty ring mode after VM creation but before any
+	 * of the vcpu creation.
+	 */
+	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+			     sizeof(struct kvm_dirty_gfn));
+}
+
+static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
+				       struct kvm_dirty_ring_indices *indices,
+				       int slot, void *bitmap,
+				       uint32_t num_pages, int index)
+{
+	struct kvm_dirty_gfn *cur;
+	uint32_t avail, fetch, count = 0;
+
+	/*
+	 * We should keep it somewhere, but to be simple we read
+	 * fetch_index too.
+	 */
+	fetch = READ_ONCE(indices->fetch_index);
+	avail = READ_ONCE(indices->avail_index);
+
+	/* Make sure we read valid entries always */
+	rmb();
+
+	DEBUG("ring %d: fetch: 0x%x, avail: 0x%x\n", index, fetch, avail);
+
+	while (fetch != avail) {
+		cur = &dirty_gfns[fetch % TEST_DIRTY_RING_COUNT];
+		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
+		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
+			    "%u != %u", cur->slot, slot);
+		TEST_ASSERT(cur->offset < num_pages, "Offset overflow: "
+			    "0x%llx >= 0x%llx", cur->offset, num_pages);
+		DEBUG("fetch 0x%x offset 0x%llx\n", fetch, cur->offset);
+		set_bit(cur->offset, bitmap);
+		fetch++;
+		count++;
+	}
+	WRITE_ONCE(indices->fetch_index, fetch);
+
+	return count;
+}
+
+static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
+					   void *bitmap, uint32_t num_pages)
+{
+	/* We only have one vcpu */
+	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
+	uint32_t count = 0, cleared;
+
+	/*
+	 * Before fetching the dirty pages, we need a vmexit of the
+	 * worker vcpu to make sure the hardware dirty buffers were
+	 * flushed.  This is not needed for dirty-log/clear-log tests
+	 * because get dirty log will natually do so.
+	 *
+	 * For now we do it in the simple way - we simply wait until
+	 * the vcpu uses up the soft dirty ring, then it'll always
+	 * do a vmexit to make sure that PML buffers will be flushed.
+	 * In real hypervisors, we probably need a vcpu kick or to
+	 * stop the vcpus (before the final sync) to make sure we'll
+	 * get all the existing dirty PFNs even cached in hardware.
+	 */
+	sem_wait(&dirty_ring_vcpu_stop);
+
+	/* Only have one vcpu */
+	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
+				       &state->vcpu_ring_indices,
+				       slot, bitmap, num_pages, VCPU_ID);
+
+	cleared = kvm_vm_reset_dirty_ring(vm);
+
+	/* Cleared pages should be the same as collected */
+	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
+		    "with collected (%u)", cleared, count);
+
+	DEBUG("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+
+	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
+}
+
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	/* A ucall-sync or ring-full event is allowed */
+	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
+		/* We should allow this to continue */
+		;
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+		sem_post(&dirty_ring_vcpu_stop);
+		DEBUG("vcpu stops because dirty ring full...\n");
+		sem_wait(&dirty_ring_vcpu_cont);
+		DEBUG("vcpu continues now.\n");
+	} else {
+		TEST_ASSERT(false, "Invalid guest sync status: "
+			    "exit_reason=%s\n",
+			    exit_reason_str(run->exit_reason));
+	}
+}
+
+static void dirty_ring_before_vcpu_join(void)
+{
+	/* Kick another round of vcpu just to make sure it will quit */
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 struct log_mode {
 	const char *name;
 	/* Hook when the vm creation is done (before vcpu creation) */
@@ -186,6 +309,7 @@ struct log_mode {
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
 	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
@@ -199,6 +323,13 @@ struct log_mode {
 		.collect_dirty_pages = clear_log_collect_dirty_pages,
 		.after_vcpu_run = default_after_vcpu_run,
 	},
+	{
+		.name = "dirty-ring",
+		.create_vm_done = dirty_ring_create_vm_done,
+		.collect_dirty_pages = dirty_ring_collect_dirty_pages,
+		.before_vcpu_join = dirty_ring_before_vcpu_join,
+		.after_vcpu_run = dirty_ring_after_vcpu_run,
+	},
 };
 
 /*
@@ -245,6 +376,14 @@ static void log_mode_after_vcpu_run(struct kvm_vm *vm)
 		mode->after_vcpu_run(vm);
 }
 
+static void log_mode_before_vcpu_join(void)
+{
+	struct log_mode *mode = &log_modes[host_log_mode];
+
+	if (mode->before_vcpu_join)
+		mode->before_vcpu_join();
+}
+
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -292,14 +431,41 @@ static void vm_dirty_log_verify(unsigned long *bmap)
 		}
 
 		if (test_and_clear_bit_le(page, bmap)) {
+			bool matched;
+
 			host_dirty_count++;
+
 			/*
 			 * If the bit is set, the value written onto
 			 * the corresponding page should be either the
 			 * previous iteration number or the current one.
+			 *
+			 * (*value_ptr == iteration - 2) case is
+			 * special only for dirty ring test where the
+			 * page is the last page before a kvm dirty
+			 * ring full userspace exit of the 2nd
+			 * iteration, if without this we'll probably
+			 * fail on the 4th iteration.  Anyway, let's
+			 * just loose the test case a little bit for
+			 * all for simplicity.
 			 */
-			TEST_ASSERT(*value_ptr == iteration ||
-				    *value_ptr == iteration - 1,
+			matched = (*value_ptr == iteration ||
+				   *value_ptr == iteration - 1 ||
+				   *value_ptr == iteration - 2);
+
+			/*
+			 * This is the common path for dirty ring
+			 * where this page is exactly the last page
+			 * touched before KVM_EXIT_DIRTY_RING_FULL.
+			 * If it happens, we should expect it to be
+			 * there for the next round.
+			 */
+			if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) {
+				set_bit_le(page, host_bmap_track);
+				continue;
+			}
+
+			TEST_ASSERT(matched,
 				    "Set page %"PRIu64" value %"PRIu64
 				    " incorrect (iteration=%"PRIu64")",
 				    page, *value_ptr, iteration);
@@ -460,6 +626,7 @@ static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
+	log_mode_before_vcpu_join();
 	pthread_join(vcpu_thread, NULL);
 
 	DEBUG("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
@@ -524,6 +691,9 @@ int main(int argc, char *argv[])
 	unsigned int host_ipa_limit;
 #endif
 
+	sem_init(&dirty_ring_vcpu_stop, 0, 0);
+	sem_init(&dirty_ring_vcpu_cont, 0, 0);
+
 #ifdef __x86_64__
 	vm_guest_mode_params_init(VM_MODE_PXXV48_4K, true, true);
 #endif
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 29cccaf96baf..4b78a8d3e773 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -67,6 +67,7 @@ enum vm_mem_backing_src_type {
 
 int kvm_check_cap(long cap);
 int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 
 struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
 struct kvm_vm *_vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm);
@@ -76,6 +77,7 @@ void kvm_vm_release(struct kvm_vm *vmp);
 void kvm_vm_get_dirty_log(struct kvm_vm *vm, int slot, void *log);
 void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 			    uint64_t first_page, uint32_t num_pages);
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm);
 
 int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva,
 		       size_t len);
@@ -137,6 +139,7 @@ void vcpu_nested_state_get(struct kvm_vm *vm, uint32_t vcpuid,
 int vcpu_nested_state_set(struct kvm_vm *vm, uint32_t vcpuid,
 			  struct kvm_nested_state *state, bool ignore_error);
 #endif
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid);
 
 const char *exit_reason_str(unsigned int exit_reason);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 41cf45416060..81222e2f841e 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -85,6 +85,26 @@ int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap)
 	return ret;
 }
 
+void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size)
+{
+	struct kvm_enable_cap cap = {};
+	int ret;
+
+	ret = kvm_check_cap(KVM_CAP_DIRTY_LOG_RING);
+
+	TEST_ASSERT(ret >= 0, "KVM_CAP_DIRTY_LOG_RING");
+
+	if (ret == 0) {
+		fprintf(stderr, "KVM does not support dirty ring, skipping tests\n");
+		exit(KSFT_SKIP);
+	}
+
+	cap.cap = KVM_CAP_DIRTY_LOG_RING;
+	cap.args[0] = ring_size;
+	vm_enable_cap(vm, &cap);
+	vm->dirty_ring_size = ring_size;
+}
+
 static void vm_open(struct kvm_vm *vm, int perm)
 {
 	vm->kvm_fd = open(KVM_DEV_PATH, perm);
@@ -297,6 +317,11 @@ void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log,
 		    strerror(-ret));
 }
 
+uint32_t kvm_vm_reset_dirty_ring(struct kvm_vm *vm)
+{
+	return ioctl(vm->fd, KVM_RESET_DIRTY_RINGS);
+}
+
 /*
  * Userspace Memory Region Find
  *
@@ -408,6 +433,13 @@ static void vm_vcpu_rm(struct kvm_vm *vm, uint32_t vcpuid)
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
 	int ret;
 
+	if (vcpu->dirty_gfns) {
+		ret = munmap(vcpu->dirty_gfns, vm->dirty_ring_size);
+		TEST_ASSERT(ret == 0, "munmap of VCPU dirty ring failed, "
+			    "rc: %i errno: %i", ret, errno);
+		vcpu->dirty_gfns = NULL;
+	}
+
 	ret = munmap(vcpu->state, sizeof(*vcpu->state));
 	TEST_ASSERT(ret == 0, "munmap of VCPU fd failed, rc: %i "
 		"errno: %i", ret, errno);
@@ -1409,6 +1441,37 @@ int _vcpu_ioctl(struct kvm_vm *vm, uint32_t vcpuid,
 	return ret;
 }
 
+void *vcpu_map_dirty_ring(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu;
+	uint32_t size = vm->dirty_ring_size;
+
+	TEST_ASSERT(size > 0, "Should enable dirty ring first");
+
+	vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu, "Cannot find vcpu %u", vcpuid);
+
+	if (!vcpu->dirty_gfns) {
+		int prot = PROT_READ | PROT_WRITE;
+		void *addr;
+
+		addr = mmap(NULL, size, prot, MAP_SHARED, vcpu->fd,
+			    vm->page_size * KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(addr == MAP_FAILED, "Dirty ring mapped writable");
+
+		prot = PROT_READ;
+		addr = mmap(NULL, size, prot, MAP_SHARED, vcpu->fd,
+			    vm->page_size * KVM_DIRTY_LOG_PAGE_OFFSET);
+		TEST_ASSERT(addr != MAP_FAILED, "Dirty ring map failed");
+
+		vcpu->dirty_gfns = addr;
+		vcpu->dirty_gfns_count = size / sizeof(struct kvm_dirty_gfn);
+	}
+
+	return vcpu->dirty_gfns;
+}
+
 /*
  * VM Ioctl
  *
@@ -1503,6 +1566,7 @@ static struct exit_reason {
 	{KVM_EXIT_INTERNAL_ERROR, "INTERNAL_ERROR"},
 	{KVM_EXIT_OSI, "OSI"},
 	{KVM_EXIT_PAPR_HCALL, "PAPR_HCALL"},
+	{KVM_EXIT_DIRTY_RING_FULL, "DIRTY_RING_FULL"},
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
 #endif
diff --git a/tools/testing/selftests/kvm/lib/kvm_util_internal.h b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
index ac50c42750cf..87edcc6746a2 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util_internal.h
+++ b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
@@ -39,6 +39,8 @@ struct vcpu {
 	uint32_t id;
 	int fd;
 	struct kvm_run *state;
+	struct kvm_dirty_gfn *dirty_gfns;
+	uint32_t dirty_gfns_count;
 };
 
 struct kvm_vm {
@@ -61,6 +63,7 @@ struct kvm_vm {
 	vm_paddr_t pgd;
 	vm_vaddr_t gdt;
 	vm_vaddr_t tss;
+	uint32_t dirty_ring_size;
 };
 
 struct vcpu *vcpu_find(struct kvm_vm *vm, uint32_t vcpuid);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (18 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

Previously the dirty ring test was working in synchronous way, because
only with a vmexit (with that it was the ring full event) we'll know
the hardware dirty bits will be flushed to the dirty ring.

With this patch we first introduced the vcpu kick mechanism by using
SIGUSR1, meanwhile we can have a guarantee of vmexit and also the
flushing of hardware dirty bits.  With all these, we can keep the vcpu
dirty work asynchronous of the whole collection procedure now.  Still,
we need to be very careful that we can only do it async if the vcpu is
not reaching soft limit (no KVM_EXIT_DIRTY_RING_FULL).  Otherwise we
must collect the dirty bits before continuing the vcpu.

Further increase the dirty ring size to current maximum to make sure
we torture more on the no-ring-full case, which should be the major
scenario when the hypervisors like QEMU would like to use this feature.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 123 +++++++++++++-----
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   8 ++
 3 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 6a551f285dea..6da97e4a9408 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -13,6 +13,9 @@
 #include <time.h>
 #include <pthread.h>
 #include <semaphore.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
 #include <linux/bitmap.h>
 #include <linux/bitops.h>
 #include <asm/barrier.h>
@@ -59,7 +62,9 @@
 # define test_and_clear_bit_le	test_and_clear_bit
 #endif
 
-#define TEST_DIRTY_RING_COUNT		1024
+#define TEST_DIRTY_RING_COUNT		65536
+
+#define SIG_IPI SIGUSR1
 
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
@@ -135,6 +140,12 @@ static uint64_t host_track_next_count;
 /* Whether dirty ring reset is requested, or finished */
 static sem_t dirty_ring_vcpu_stop;
 static sem_t dirty_ring_vcpu_cont;
+/*
+ * This is updated by the vcpu thread to tell the host whether it's a
+ * ring-full event.  It should only be read until a sem_wait() of
+ * dirty_ring_vcpu_stop and before vcpu continues to run.
+ */
+static bool dirty_ring_vcpu_ring_full;
 
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
@@ -151,6 +162,33 @@ enum log_mode_t {
 
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
+pthread_t vcpu_thread;
+
+/* Only way to pass this to the signal handler */
+struct kvm_vm *current_vm;
+
+static void vcpu_sig_handler(int sig)
+{
+	TEST_ASSERT(sig == SIG_IPI, "unknown signal: %d", sig);
+}
+
+static void vcpu_kick(void)
+{
+	pthread_kill(vcpu_thread, SIG_IPI);
+}
+
+/*
+ * In our test we do signal tricks, let's use a better version of
+ * sem_wait to avoid signal interrupts
+ */
+static void sem_wait_until(sem_t *sem)
+{
+	int ret;
+
+	do
+		ret = sem_wait(sem);
+	while (ret == -1 && errno == EINTR);
+}
 
 static void clear_log_create_vm_done(struct kvm_vm *vm)
 {
@@ -179,10 +217,13 @@ static void clear_log_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	kvm_vm_clear_dirty_log(vm, slot, bitmap, 0, num_pages);
 }
 
-static void default_after_vcpu_run(struct kvm_vm *vm)
+static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
+	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
+		    "vcpu run failed: errno=%d", err);
+
 	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
 		    "Invalid guest sync status: exit_reason=%s\n",
 		    exit_reason_str(run->exit_reason));
@@ -235,27 +276,37 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	return count;
 }
 
+static void dirty_ring_wait_vcpu(void)
+{
+	/* This makes sure that hardware PML cache flushed */
+	vcpu_kick();
+	sem_wait_until(&dirty_ring_vcpu_stop);
+}
+
+static void dirty_ring_continue_vcpu(void)
+{
+	DEBUG("Notifying vcpu to continue\n");
+	sem_post(&dirty_ring_vcpu_cont);
+}
+
 static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 					   void *bitmap, uint32_t num_pages)
 {
 	/* We only have one vcpu */
 	struct kvm_run *state = vcpu_state(vm, VCPU_ID);
 	uint32_t count = 0, cleared;
+	bool continued_vcpu = false;
 
-	/*
-	 * Before fetching the dirty pages, we need a vmexit of the
-	 * worker vcpu to make sure the hardware dirty buffers were
-	 * flushed.  This is not needed for dirty-log/clear-log tests
-	 * because get dirty log will natually do so.
-	 *
-	 * For now we do it in the simple way - we simply wait until
-	 * the vcpu uses up the soft dirty ring, then it'll always
-	 * do a vmexit to make sure that PML buffers will be flushed.
-	 * In real hypervisors, we probably need a vcpu kick or to
-	 * stop the vcpus (before the final sync) to make sure we'll
-	 * get all the existing dirty PFNs even cached in hardware.
-	 */
-	sem_wait(&dirty_ring_vcpu_stop);
+	dirty_ring_wait_vcpu();
+
+	if (!dirty_ring_vcpu_ring_full) {
+		/*
+		 * This is not a ring-full event, it's safe to allow
+		 * vcpu to continue
+		 */
+		dirty_ring_continue_vcpu();
+		continued_vcpu = true;
+	}
 
 	/* Only have one vcpu */
 	count = dirty_ring_collect_one(vcpu_map_dirty_ring(vm, VCPU_ID),
@@ -268,13 +319,16 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
 		    "with collected (%u)", cleared, count);
 
-	DEBUG("Notifying vcpu to continue\n");
-	sem_post(&dirty_ring_vcpu_cont);
+	if (!continued_vcpu) {
+		TEST_ASSERT(dirty_ring_vcpu_ring_full,
+			    "Didn't continue vcpu even without ring full");
+		dirty_ring_continue_vcpu();
+	}
 
 	DEBUG("Iteration %ld collected %u pages\n", iteration, count);
 }
 
-static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
+static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
 
@@ -282,10 +336,16 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm)
 	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 		/* We should allow this to continue */
 		;
-	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL) {
+	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
+		   (ret == -1 && err == EINTR)) {
+		/* Update the flag first before pause */
+		WRITE_ONCE(dirty_ring_vcpu_ring_full,
+			   run->exit_reason == KVM_EXIT_DIRTY_RING_FULL);
 		sem_post(&dirty_ring_vcpu_stop);
-		DEBUG("vcpu stops because dirty ring full...\n");
-		sem_wait(&dirty_ring_vcpu_cont);
+		DEBUG("vcpu stops because %s...\n",
+		      dirty_ring_vcpu_ring_full ?
+		      "dirty ring is full" : "vcpu is kicked out");
+		sem_wait_until(&dirty_ring_vcpu_cont);
 		DEBUG("vcpu continues now.\n");
 	} else {
 		TEST_ASSERT(false, "Invalid guest sync status: "
@@ -308,7 +368,7 @@ struct log_mode {
 	void (*collect_dirty_pages) (struct kvm_vm *vm, int slot,
 				     void *bitmap, uint32_t num_pages);
 	/* Hook to call when after each vcpu run */
-	void (*after_vcpu_run)(struct kvm_vm *vm);
+	void (*after_vcpu_run)(struct kvm_vm *vm, int ret, int err);
 	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
@@ -368,12 +428,12 @@ static void log_mode_collect_dirty_pages(struct kvm_vm *vm, int slot,
 	mode->collect_dirty_pages(vm, slot, bitmap, num_pages);
 }
 
-static void log_mode_after_vcpu_run(struct kvm_vm *vm)
+static void log_mode_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 {
 	struct log_mode *mode = &log_modes[host_log_mode];
 
 	if (mode->after_vcpu_run)
-		mode->after_vcpu_run(vm);
+		mode->after_vcpu_run(vm, ret, err);
 }
 
 static void log_mode_before_vcpu_join(void)
@@ -397,15 +457,21 @@ static void *vcpu_worker(void *data)
 	int ret;
 	struct kvm_vm *vm = data;
 	uint64_t *guest_array;
+	struct sigaction sigact;
+
+	current_vm = vm;
+	memset(&sigact, 0, sizeof(sigact));
+	sigact.sa_handler = vcpu_sig_handler;
+	sigaction(SIG_IPI, &sigact, NULL);
 
 	guest_array = addr_gva2hva(vm, (vm_vaddr_t)random_array);
 
 	while (!READ_ONCE(host_quit)) {
+		/* Clear any existing kick signals */
 		generate_random_array(guest_array, TEST_PAGES_PER_LOOP);
 		/* Let the guest dirty the random pages */
-		ret = _vcpu_run(vm, VCPU_ID);
-		TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-		log_mode_after_vcpu_run(vm);
+		ret = __vcpu_run(vm, VCPU_ID);
+		log_mode_after_vcpu_run(vm, ret, errno);
 	}
 
 	return NULL;
@@ -528,7 +594,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, uint32_t vcpuid,
 static void run_test(enum vm_guest_mode mode, unsigned long iterations,
 		     unsigned long interval, uint64_t phys_offset)
 {
-	pthread_t vcpu_thread;
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 4b78a8d3e773..e64fbfe6bbd5 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -115,6 +115,7 @@ vm_paddr_t addr_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva);
 struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
+int __vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid);
 void vcpu_set_mp_state(struct kvm_vm *vm, uint32_t vcpuid,
 		       struct kvm_mp_state *mp_state);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 81222e2f841e..12c83e2f3300 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1187,6 +1187,14 @@ int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
 	return rc;
 }
 
+int __vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
+{
+	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
+
+	TEST_ASSERT(vcpu != NULL, "vcpu not found, vcpuid: %u", vcpuid);
+	return ioctl(vcpu->fd, KVM_RUN, NULL);
+}
+
 void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid)
 {
 	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (19 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
@ 2020-01-09 14:57 ` Peter Xu
  2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 14:57 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert

It's only used to override the existing dirty ring size/count.  If
with a bigger ring count, we test async of dirty ring.  If with a
smaller ring count, we test ring full code path.  Async is default.

It has no use for non-dirty-ring tests.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 6da97e4a9408..fb6c33dbaf35 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -163,6 +163,7 @@ enum log_mode_t {
 /* Mode of logging.  Default is LOG_MODE_DIRTY_LOG */
 static enum log_mode_t host_log_mode;
 pthread_t vcpu_thread;
+static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
 
 /* Only way to pass this to the signal handler */
 struct kvm_vm *current_vm;
@@ -235,7 +236,7 @@ static void dirty_ring_create_vm_done(struct kvm_vm *vm)
 	 * Switch to dirty ring mode after VM creation but before any
 	 * of the vcpu creation.
 	 */
-	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT *
+	vm_enable_dirty_ring(vm, test_dirty_ring_count *
 			     sizeof(struct kvm_dirty_gfn));
 }
 
@@ -260,7 +261,7 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	DEBUG("ring %d: fetch: 0x%x, avail: 0x%x\n", index, fetch, avail);
 
 	while (fetch != avail) {
-		cur = &dirty_gfns[fetch % TEST_DIRTY_RING_COUNT];
+		cur = &dirty_gfns[fetch % test_dirty_ring_count];
 		TEST_ASSERT(cur->pad == 0, "Padding is non-zero: 0x%x", cur->pad);
 		TEST_ASSERT(cur->slot == slot, "Slot number didn't match: "
 			    "%u != %u", cur->slot, slot);
@@ -723,6 +724,9 @@ static void help(char *name)
 	printf("usage: %s [-h] [-i iterations] [-I interval] "
 	       "[-p offset] [-m mode]\n", name);
 	puts("");
+	printf(" -c: specify dirty ring size, in number of entries\n");
+	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
+	       TEST_DIRTY_RING_COUNT);
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
 	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
@@ -778,8 +782,11 @@ int main(int argc, char *argv[])
 	vm_guest_mode_params_init(VM_MODE_P40V48_4K, true, true);
 #endif
 
-	while ((opt = getopt(argc, argv, "hi:I:p:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
 		switch (opt) {
+		case 'c':
+			test_dirty_ring_count = strtol(optarg, NULL, 10);
+			break;
 		case 'i':
 			iterations = strtol(optarg, NULL, 10);
 			break;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (20 preceding siblings ...)
  2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
@ 2020-01-09 15:59 ` Michael S. Tsirkin
  2020-01-09 16:17   ` Peter Xu
  2020-01-09 16:47 ` Alex Williamson
  2020-01-19  9:11 ` Paolo Bonzini
  23 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote:
> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> (based on kvm/queue)
> 
> Please refer to either the previous cover letters, or documentation
> update in patch 12 for the big picture.

I would rather you pasted it here. There's no way to respond otherwise.

For something that's presumably an optimization, isn't there
some kind of testing that can be done to show the benefits?
What kind of gain was observed?

I know it's mostly relevant for huge VMs, but OTOH these
probably use huge pages.




>  Previous posts:
> 
> V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> 
> The major change in V3 is that we dropped the whole waitqueue and the
> global lock. With that, we have clean per-vcpu ring and no default
> ring any more.  The two kvmgt refactoring patches were also included
> to show the dependency of the works.
> 
> Patchset layout:
> 
> Patch 1-2:         Picked up from kvmgt refactoring
> Patch 3-6:         Small patches that are not directly related,
>                    (So can be acked/nacked/picked as standalone)
> Patch 7-11:        Prepares for the dirty ring interface
> Patch 12:          Major implementation
> Patch 13-14:       Quick follow-ups for patch 8
> Patch 15-21:       Test cases
> 
> V3 changelog:
> 
> - fail userspace writable maps on dirty ring ranges [Jason]
> - commit message fixups [Paolo]
> - change __x86_set_memory_region to return hva [Paolo]
> - cacheline align for indices [Paolo, Jason]
> - drop waitqueue, global lock, etc., include kvmgt rework patchset
> - take lock for __x86_set_memory_region() (otherwise it triggers a
>   lockdep in latest kvm/queue) [Paolo]
> - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
> - one more patch to drop x86_set_memory_region [Paolo]
> - one more patch to remove extra srcu usage in init_rmode_identity_map()
> - add some r-bs for Paolo
> 
> Please review, thanks.
> 
> Paolo Bonzini (1):
>   KVM: Move running VCPU from ARM to common code
> 
> Peter Xu (18):
>   KVM: Remove kvm_read_guest_atomic()
>   KVM: Add build-time error check on kvm_run size
>   KVM: X86: Change parameter for fast_page_fault tracepoint
>   KVM: X86: Don't take srcu lock in init_rmode_identity_map()
>   KVM: Cache as_id in kvm_memory_slot
>   KVM: X86: Drop x86_set_memory_region()
>   KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
>   KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
>   KVM: X86: Implement ring-based dirty memory tracking
>   KVM: Make dirty ring exclusive to dirty bitmap log
>   KVM: Don't allocate dirty bitmap if dirty ring is enabled
>   KVM: selftests: Always clear dirty bitmap after iteration
>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>   KVM: selftests: Use a single binary for dirty/clear log test
>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>   KVM: selftests: Add dirty ring buffer test
>   KVM: selftests: Let dirty_log_test async for dirty ring test
>   KVM: selftests: Add "-c" parameter to dirty log test
> 
> Yan Zhao (2):
>   vfio: introduce vfio_iova_rw to read/write a range of IOVAs
>   drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw
> 
>  Documentation/virt/kvm/api.txt                |  96 ++++
>  arch/arm/include/asm/kvm_host.h               |   2 -
>  arch/arm64/include/asm/kvm_host.h             |   2 -
>  arch/x86/include/asm/kvm_host.h               |   7 +-
>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>  arch/x86/kvm/Makefile                         |   3 +-
>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>  arch/x86/kvm/mmutrace.h                       |   9 +-
>  arch/x86/kvm/svm.c                            |   3 +-
>  arch/x86/kvm/vmx/vmx.c                        |  86 ++--
>  arch/x86/kvm/x86.c                            |  43 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c              |  25 +-
>  drivers/vfio/vfio.c                           |  45 ++
>  drivers/vfio/vfio_iommu_type1.c               |  81 ++++
>  include/linux/kvm_dirty_ring.h                |  55 +++
>  include/linux/kvm_host.h                      |  37 +-
>  include/linux/vfio.h                          |   5 +
>  include/trace/events/kvm.h                    |  78 ++++
>  include/uapi/linux/kvm.h                      |  33 ++
>  tools/include/uapi/linux/kvm.h                |  38 ++
>  tools/testing/selftests/kvm/Makefile          |   2 -
>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>  tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   4 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  72 +++
>  .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
>  virt/kvm/arm/arch_timer.c                     |   2 +-
>  virt/kvm/arm/arm.c                            |  29 --
>  virt/kvm/arm/perf.c                           |   6 +-
>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>  virt/kvm/dirty_ring.c                         | 162 +++++++
>  virt/kvm/kvm_main.c                           | 215 +++++++--
>  32 files changed, 1379 insertions(+), 208 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>  create mode 100644 virt/kvm/dirty_ring.c
> 
> -- 
> 2.24.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
@ 2020-01-09 16:17   ` Peter Xu
  2020-01-09 16:40     ` Michael S. Tsirkin
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 16:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 10:59:50AM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote:
> > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > (based on kvm/queue)
> > 
> > Please refer to either the previous cover letters, or documentation
> > update in patch 12 for the big picture.
> 
> I would rather you pasted it here. There's no way to respond otherwise.

Sure, will do in the next post.

> 
> For something that's presumably an optimization, isn't there
> some kind of testing that can be done to show the benefits?
> What kind of gain was observed?

Since the interface seems to settle soon, maybe it's time to work on
the QEMU part so I can give some number.  It would be interesting to
know the curves between dirty logging and dirty ring even for some
small vms that have some workloads inside.

> 
> I know it's mostly relevant for huge VMs, but OTOH these
> probably use huge pages.

Yes huge VMs could benefit more, especially if the dirty rate is not
that high, I believe.  Though, could you elaborate on why huge pages
are special here?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
@ 2020-01-09 16:29   ` Michael S. Tsirkin
  2020-01-09 16:56     ` Alex Williamson
  2020-01-09 19:15     ` Peter Xu
  2020-01-11  4:49   ` kbuild test robot
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 16:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> 
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
> 
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
> 
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).

No longer, this uses an array of structs.

>  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
> 
> This patch enables dirty ring for X86 only.  However it should be
> easily extended to other archs as well.
> 
> [1] https://patchwork.kernel.org/patch/10471409/
> 
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  Documentation/virt/kvm/api.txt  |  89 ++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/include/uapi/asm/kvm.h |   1 +
>  arch/x86/kvm/Makefile           |   3 +-
>  arch/x86/kvm/mmu/mmu.c          |   6 ++
>  arch/x86/kvm/vmx/vmx.c          |   7 ++
>  arch/x86/kvm/x86.c              |   9 ++
>  include/linux/kvm_dirty_ring.h  |  55 +++++++++++
>  include/linux/kvm_host.h        |  26 +++++
>  include/trace/events/kvm.h      |  78 +++++++++++++++
>  include/uapi/linux/kvm.h        |  33 +++++++
>  virt/kvm/dirty_ring.c           | 162 ++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             | 137 ++++++++++++++++++++++++++-
>  13 files changed, 606 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  create mode 100644 virt/kvm/dirty_ring.c
> 
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index ebb37b34dcfc..708c3e0f7eae 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>  It is thus encouraged to use the vm ioctl to query for capabilities (available
>  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>  
> +
>  4.5 KVM_GET_VCPU_MMAP_SIZE
>  
>  Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>  memory region.  This ioctl returns the size of that region.  See the
>  KVM_RUN documentation for details.
>  
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.
> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>  
>  4.6 KVM_SET_MEMORY_REGION
>  
> @@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>  
>  See KVM_CAP_VCPU_EVENTS for more details.
> +
>  8.20 KVM_CAP_HYPERV_SEND_IPI
>  
>  Architectures: x86
> @@ -5383,6 +5397,7 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>  
>  Architecture: x86
> @@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>  flush hypercalls by Hyper-V) so userspace should disable KVM identification
>  in CPUID and only exposes Hyper-V identification. In this case, guest
>  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu.
> +
> +One dirty ring is defined as below internally:
> +
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +	struct kvm_dirty_ring_indices *indices;
> +	int index;
> +};
> +
> +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> +For each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {
> +        __u32 pad;

How about sticking a length here?
This way huge pages can be dirtied in one go.

> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};
> +
> +Most of the ring structure is used by KVM internally, while only the
> +indices are exposed to userspace:
> +
> +struct kvm_dirty_ring_indices {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
> +The two indices in the ring buffer are free running counters.
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.


I know index design is popular, but testing with virtio showed
that it's better to just have a flags field marking
an entry as valid. In particular this gets rid of the
running counters and power of two limitations.
It also removes the need for a separate index page, which is nice.



>  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

Where's this number coming from? Given you have indices as well,
4K size rings is likely to cause cache contention.

> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.
> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer (which was previously
> +done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one needs to
> +kick the vcpu out for a hardware buffer flush (vmexit) to make sure
> +all the existing dirty gfns are flushed to the dirty rings.
> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
> +ioctl will return to userspace with zero.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f536d139b3d2..3fe18402e6a3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1181,6 +1181,7 @@ struct kvm_x86_ops {
>  					   struct kvm_memory_slot *slot,
>  					   gfn_t offset, unsigned long mask);
>  	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
> +	int (*cpu_dirty_log_size)(void);
>  
>  	/* pmu operations of sub-arch */
>  	const struct kvm_pmu_ops *pmu_ops;
> @@ -1666,4 +1667,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define GET_SMSTATE(type, buf, offset)		\
>  	(*(type *)((buf) + (offset) - 0x7e00))
>  
> +int kvm_cpu_dirty_log_size(void);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 503d3f42da16..b59bf356c478 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -12,6 +12,7 @@
>  
>  #define KVM_PIO_PAGE_OFFSET 1
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>  
>  #define DE_VECTOR 0
>  #define DB_VECTOR 1
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>  KVM := ../../../virt/kvm
>  
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>  
>  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7269130ea5e2..621b842a9b7b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1832,7 +1832,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
>  {
>  	if (kvm_x86_ops->write_log_dirty)
>  		return kvm_x86_ops->write_log_dirty(vcpu);
> +	return 0;
> +}
>  
> +int kvm_cpu_dirty_log_size(void)
> +{
> +	if (kvm_x86_ops->cpu_dirty_log_size)
> +		return kvm_x86_ops->cpu_dirty_log_size();
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 62175a246bcc..2151de89456d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7689,6 +7689,7 @@ static __init int hardware_setup(void)
>  		kvm_x86_ops->slot_disable_log_dirty = NULL;
>  		kvm_x86_ops->flush_log_dirty = NULL;
>  		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
> +		kvm_x86_ops->cpu_dirty_log_size = NULL;
>  	}
>  
>  	if (!cpu_has_vmx_preemption_timer())
> @@ -7753,6 +7754,11 @@ static __exit void hardware_unsetup(void)
>  	free_kvm_area();
>  }
>  
> +static int vmx_cpu_dirty_log_size(void)
> +{
> +	return enable_pml ? PML_ENTITY_NUM : 0;
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.cpu_has_kvm_support = cpu_has_kvm_support,
>  	.disabled_by_bios = vmx_disabled_by_bios,
> @@ -7875,6 +7881,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.flush_log_dirty = vmx_flush_log_dirty,
>  	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
>  	.write_log_dirty = vmx_write_pml_buffer,
> +	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
>  
>  	.pre_block = vmx_pre_block,
>  	.post_block = vmx_post_block,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff97782b3919..9c3673592826 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7998,6 +7998,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	bool req_immediate_exit = false;
>  
> +	/* Forbid vmenter if vcpu dirty ring is soft-full */
> +	if (unlikely(vcpu->kvm->dirty_ring_size &&
> +		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> +		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +		trace_kvm_dirty_ring_exit(vcpu);
> +		r = 0;
> +		goto out;
> +	}
> +
>  	if (kvm_request_pending(vcpu)) {
>  		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) {
>  			if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) {
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..d6fe9e1b7617
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,55 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/**
> + * kvm_dirty_ring: KVM internal dirty ring structure
> + *
> + * @dirty_index: free running counter that points to the next slot in
> + *               dirty_ring->dirty_gfns, where a new dirty page should go
> + * @reset_index: free running counter that points to the next dirty page
> + *               in dirty_ring->dirty_gfns for which dirty trap needs to
> + *               be reenabled
> + * @size:        size of the compact list, dirty_ring->dirty_gfns
> + * @soft_limit:  when the number of dirty pages in the list reaches this
> + *               limit, vcpu that owns this ring should exit to userspace
> + *               to allow userspace to harvest all the dirty pages
> + * @dirty_gfns:  the array to keep the dirty gfns
> + * @indices:     the pointer to the @kvm_dirty_ring_indices structure
> + *               of this specific ring
> + * @index:       index of this dirty ring
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +	struct kvm_dirty_ring_indices *indices;
> +	int index;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indices *indices,
> +			 int index, u32 size);
> +struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * returns =0: successfully pushed
> + *         <0: unable to push, need to wait
> + */
> +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cbd633ece959..c96161c6a0c9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>  #include <linux/kvm_types.h>
>  
>  #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>  
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -321,6 +322,7 @@ struct kvm_vcpu {
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>  };
>  
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -502,6 +504,7 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	u32 dirty_ring_size;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -831,6 +834,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  					gfn_t gfn_offset,
>  					unsigned long mask);
>  
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  				struct kvm_dirty_log *log);
>  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1409,4 +1414,25 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>  
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures, while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +
>  #endif
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 2c735a3e6613..3d850997940c 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -399,6 +399,84 @@ TRACE_EVENT(kvm_halt_poll_ns,
>  #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
>  	trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
>  
> +TRACE_EVENT(kvm_dirty_ring_push,
> +	TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
> +	TP_ARGS(ring, slot, offset),
> +
> +	TP_STRUCT__entry(
> +		__field(int, index)
> +		__field(u32, dirty_index)
> +		__field(u32, reset_index)
> +		__field(u32, slot)
> +		__field(u64, offset)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index          = ring->index;
> +		__entry->dirty_index    = ring->dirty_index;
> +		__entry->reset_index    = ring->reset_index;
> +		__entry->slot           = slot;
> +		__entry->offset         = offset;
> +	),
> +
> +	TP_printk("ring %d: dirty 0x%x reset 0x%x "
> +		  "slot %u offset 0x%llx (used %u)",
> +		  __entry->index, __entry->dirty_index,
> +		  __entry->reset_index,  __entry->slot, __entry->offset,
> +		  __entry->dirty_index - __entry->reset_index)
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_reset,
> +	TP_PROTO(struct kvm_dirty_ring *ring),
> +	TP_ARGS(ring),
> +
> +	TP_STRUCT__entry(
> +		__field(int, index)
> +		__field(u32, dirty_index)
> +		__field(u32, reset_index)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index          = ring->index;
> +		__entry->dirty_index    = ring->dirty_index;
> +		__entry->reset_index    = ring->reset_index;
> +	),
> +
> +	TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
> +		  __entry->index, __entry->dirty_index, __entry->reset_index,
> +		  __entry->dirty_index - __entry->reset_index)
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_waitqueue,
> +	TP_PROTO(bool enter),
> +	TP_ARGS(enter),
> +
> +	TP_STRUCT__entry(
> +	    __field(bool, enter)
> +	),
> +
> +	TP_fast_assign(
> +	    __entry->enter = enter;
> +	),
> +
> +	TP_printk("%s", __entry->enter ? "wait" : "awake")
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_exit,
> +	TP_PROTO(struct kvm_vcpu *vcpu),
> +	TP_ARGS(vcpu),
> +
> +	TP_STRUCT__entry(
> +	    __field(int, vcpu_id)
> +	),
> +
> +	TP_fast_assign(
> +	    __entry->vcpu_id = vcpu->vcpu_id;
> +	),
> +
> +	TP_printk("vcpu %d", __entry->vcpu_id)
> +);
> +
>  #endif /* _TRACE_KVM_MAIN_H */
>  
>  /* This part must be outside protection */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f0a16b4adbbd..df4a1700ff1e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_DIRTY_RING_FULL  29
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -247,6 +248,13 @@ struct kvm_hyperv_exit {
>  /* Encounter unexpected vm-exit reason */
>  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>  
> +struct kvm_dirty_ring_indices {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 padding1;
> +	__u32 fetch_index; /* set by userspace */
> +	__u32 padding2;
> +};
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -421,6 +429,8 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	struct kvm_dirty_ring_indices vcpu_ring_indices;
>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1019,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
>  #define KVM_CAP_ARM_NISV_TO_USER 177
>  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1473,6 +1484,9 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>  
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> @@ -1623,4 +1637,23 @@ struct kvm_hyperv_eventfd {
>  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>  
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..67ec5bbc21c0
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,162 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * KVM dirty ring implementation
> + *
> + * Copyright 2019 Red Hat, Inc.
> + */
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +#include <trace/events/kvm.h>
> +
> +int __weak kvm_cpu_dirty_log_size(void)
> +{
> +	return 0;
> +}
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
> +}
> +
> +bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +
> +	WARN_ON_ONCE(vcpu->kvm != kvm);
> +
> +	return &vcpu->dirty_ring;
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indices *indices,
> +			 int index, u32 size)
> +{
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	ring->index = index;
> +	ring->indices = indices;
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +	bool first_round = true;
> +
> +	fetch = READ_ONCE(indices->fetch_index);

So this does not work if the data cache is virtually tagged.
Which to the best of my knowledge isn't the case on any
CPU kvm supports. However it might not stay being the
case forever. Worth at least commenting.


> +
> +	/*
> +	 * Note that fetch_index is written by the userspace, which
> +	 * should not be trusted.  If this happens, then it's probably
> +	 * that the userspace has written a wrong fetch_index.
> +	 */
> +	if (fetch - ring->reset_index > ring->size)
> +		return -EINVAL;
> +
> +	if (fetch == ring->reset_index)
> +		return 0;
> +
> +	/* This is only needed to make compilers happy */
> +	cur_slot = cur_offset = mask = 0;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);

What is this READ_ONCE doing? Entries are only written by kernel
and it's under lock.

> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.
> +		 */
> +		if (!first_round && next_slot == cur_slot) {
> +			s64 delta = next_offset - cur_offset;
> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}

Well how important is this logic? Because it will not be
too effective on an SMP system, so don't you need a per-cpu ring?



> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +		first_round = false;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +	trace_kvm_dirty_ring_reset(ring);
> +
> +	return count;
> +}
> +
> +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> +{
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +
> +	/* It should never get full */
> +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	/*
> +	 * Make sure the data is filled in before we publish this to
> +	 * the userspace program.  There's no paired kernel-side reader.
> +	 */
> +	smp_wmb();
> +	ring->dirty_index++;


Do I understand it correctly that the ring is shared between CPUs?
If so I don't understand why it's safe for SMP guests.
Don't you need atomics or locking?


> +	WRITE_ONCE(indices->avail_index, ring->dirty_index);
> +
> +	trace_kvm_dirty_ring_push(ring, slot, offset);
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	vfree(ring->dirty_gfns);
> +	ring->dirty_gfns = NULL;
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5bbd8b8730fa..5e36792e15ae 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>  
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>  
> @@ -357,11 +359,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>  
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
> +					 &vcpu->run->vcpu_ring_indices,
> +					 id, kvm->dirty_ring_size);
> +		if (r)
> +			goto fail_free_run;
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>  
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -379,6 +392,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>  
> @@ -2284,8 +2299,13 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> +		u32 slot = (memslot->as_id << 16) | memslot->id;
>  
> -		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> +		if (kvm->dirty_ring_size)
> +			kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
> +					    slot, rel_gfn);
> +		else
> +			set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
>  
> @@ -2632,6 +2652,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>  
> +static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
> +{
> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
> +		return false;
> +
> +	return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2647,6 +2677,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>  #endif
> +	else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>  	else
>  		return kvm_arch_vcpu_fault(vcpu, vmf);
>  	get_page(page);
> @@ -2660,6 +2694,15 @@ static const struct vm_operations_struct kvm_vcpu_vm_ops = {
>  
>  static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	struct kvm_vcpu *vcpu = file->private_data;
> +	unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	/* If to map any writable page within dirty ring, fail it */
> +	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> +	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> +	    vma->vm_flags & VM_WRITE)
> +		return -EINVAL;
> +
>  	vma->vm_ops = &kvm_vcpu_vm_ops;
>  	return 0;
>  }
> @@ -3242,12 +3285,97 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #endif
>  	case KVM_CAP_NR_MEMSLOTS:
>  		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +#ifdef CONFIG_X86
> +		return KVM_DIRTY_RING_MAX_ENTRIES;
> +#else
> +		return 0;
> +#endif
>  	default:
>  		break;
>  	}
>  	return kvm_vm_ioctl_check_extension(kvm, arg);
>  }
>  
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
> +		return -EINVAL;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;
> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = 0;
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	return cleared;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -3265,6 +3393,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		kvm->manual_dirty_log_protect = cap->args[0];
>  		return 0;
>  #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -3452,6 +3582,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  	case KVM_CHECK_EXTENSION:
>  		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>  		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
> -- 
> 2.24.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 16:17   ` Peter Xu
@ 2020-01-09 16:40     ` Michael S. Tsirkin
  2020-01-09 17:08       ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 16:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 11:17:42AM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 10:59:50AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote:
> > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > (based on kvm/queue)
> > > 
> > > Please refer to either the previous cover letters, or documentation
> > > update in patch 12 for the big picture.
> > 
> > I would rather you pasted it here. There's no way to respond otherwise.
> 
> Sure, will do in the next post.
> 
> > 
> > For something that's presumably an optimization, isn't there
> > some kind of testing that can be done to show the benefits?
> > What kind of gain was observed?
> 
> Since the interface seems to settle soon, maybe it's time to work on
> the QEMU part so I can give some number.  It would be interesting to
> know the curves between dirty logging and dirty ring even for some
> small vms that have some workloads inside.
> 
> > 
> > I know it's mostly relevant for huge VMs, but OTOH these
> > probably use huge pages.
> 
> Yes huge VMs could benefit more, especially if the dirty rate is not
> that high, I believe.  Though, could you elaborate on why huge pages
> are special here?
> 
> Thanks,

With hugetlbfs there are less bits to test: e.g. with 2M pages a single
bit set marks 512 pages as dirty.  We do not take advantage of this
but it looks like a rather obvious optimization.

> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled
  2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
@ 2020-01-09 16:41   ` Peter Xu
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 16:41 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 09:57:22AM -0500, Peter Xu wrote:
> Because kvm dirty rings and kvm dirty log is used in an exclusive way,
> Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
> At the meantime, since the dirty_bitmap will be conditionally created
> now, we can't use it as a sign of "whether this memory slot enabled
> dirty tracking".  Change users like that to check against the kvm
> memory slot flags.
> 
> Note that there still can be chances where the kvm memory slot got its
> dirty_bitmap allocated, _if_ the memory slots are created before
> enabling of the dirty rings and at the same time with the dirty
> tracking capability enabled, they'll still with the dirty_bitmap.
> However it should not hurt much (e.g., the bitmaps will always be
> freed if they are there), and the real users normally won't trigger
> this because dirty bit tracking flag should in most cases only be
> applied to kvm slots only before migration starts, that should be far
> latter than kvm initializes (VM starts).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/kvm_host.h | 5 +++++
>  virt/kvm/kvm_main.c      | 5 +++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c96161c6a0c9..ab2a169b1264 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -353,6 +353,11 @@ struct kvm_memory_slot {
>  	u8 as_id;
>  };
>  
> +static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
> +{
> +	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> +}
> +
>  static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
>  {
>  	return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f0f766183cb2..46da3169944f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1120,7 +1120,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	}
>  
>  	/* Allocate page dirty bitmap if needed */
> -	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) {
> +	if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap &&
> +	    !kvm->dirty_ring_size) {
>  		if (kvm_create_dirty_bitmap(&new) < 0)
>  			goto out_free;
>  	}
> @@ -2309,7 +2310,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn)
>  {
> -	if (memslot && memslot->dirty_bitmap) {
> +	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
>  		u32 slot = (memslot->as_id << 16) | memslot->id;
>  
> -- 
> 2.24.1
> 

I think below should be squashed as well into this patch:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 621b842a9b7b..0806bd12d8ee 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1308,7 +1308,7 @@ static inline bool memslot_valid_for_gpte(struct kvm_memory_slot *slot,
 {
        if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
                return false;
-       if (no_dirty_log && slot->dirty_bitmap)
+       if (no_dirty_log && kvm_slot_dirty_track_enabled(slot))
                return false;
  
        return true;

Thanks,

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (21 preceding siblings ...)
  2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
@ 2020-01-09 16:47 ` Alex Williamson
  2020-01-09 17:58   ` Peter Xu
  2020-01-19  9:11 ` Paolo Bonzini
  23 siblings, 1 reply; 82+ messages in thread
From: Alex Williamson @ 2020-01-09 16:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert,
	Kirti Wankhede

On Thu,  9 Jan 2020 09:57:08 -0500
Peter Xu <peterx@redhat.com> wrote:

> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> (based on kvm/queue)
> 
> Please refer to either the previous cover letters, or documentation
> update in patch 12 for the big picture.  Previous posts:
> 
> V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> 
> The major change in V3 is that we dropped the whole waitqueue and the
> global lock. With that, we have clean per-vcpu ring and no default
> ring any more.  The two kvmgt refactoring patches were also included
> to show the dependency of the works.

Hi Peter,

Would you recommend this style of interface for vfio dirty page
tracking as well?  This mechanism seems very tuned to sparse page
dirtying, how well does it handle fully dirty, or even significantly
dirty regions?  We also don't really have "active" dirty page tracking
in vfio, we simply assume that if a page is pinned or otherwise mapped
that it's dirty, so I think we'd constantly be trying to re-populate
the dirty ring with pages that we've seen the user consume, which
doesn't seem like a good fit versus a bitmap solution.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 16:29   ` Michael S. Tsirkin
@ 2020-01-09 16:56     ` Alex Williamson
  2020-01-09 19:21       ` Peter Xu
  2020-01-09 19:15     ` Peter Xu
  1 sibling, 1 reply; 82+ messages in thread
From: Alex Williamson @ 2020-01-09 16:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Xu, kvm, linux-kernel, Christophe de Dinechin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, 9 Jan 2020 11:29:28 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  
> 
> No longer, this uses an array of structs.
> 
> >  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> > 
> > This patch enables dirty ring for X86 only.  However it should be
> > easily extended to other archs as well.
> > 
> > [1] https://patchwork.kernel.org/patch/10471409/
> > 
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  Documentation/virt/kvm/api.txt  |  89 ++++++++++++++++++
> >  arch/x86/include/asm/kvm_host.h |   3 +
> >  arch/x86/include/uapi/asm/kvm.h |   1 +
> >  arch/x86/kvm/Makefile           |   3 +-
> >  arch/x86/kvm/mmu/mmu.c          |   6 ++
> >  arch/x86/kvm/vmx/vmx.c          |   7 ++
> >  arch/x86/kvm/x86.c              |   9 ++
> >  include/linux/kvm_dirty_ring.h  |  55 +++++++++++
> >  include/linux/kvm_host.h        |  26 +++++
> >  include/trace/events/kvm.h      |  78 +++++++++++++++
> >  include/uapi/linux/kvm.h        |  33 +++++++
> >  virt/kvm/dirty_ring.c           | 162 ++++++++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c             | 137 ++++++++++++++++++++++++++-
> >  13 files changed, 606 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/kvm_dirty_ring.h
> >  create mode 100644 virt/kvm/dirty_ring.c
> > 
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index ebb37b34dcfc..708c3e0f7eae 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >  
> > +
> >  4.5 KVM_GET_VCPU_MMAP_SIZE
> >  
> >  Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> >  memory region.  This ioctl returns the size of that region.  See the
> >  KVM_RUN documentation for details.
> >  
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >  
> >  4.6 KVM_SET_MEMORY_REGION
> >  
> > @@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> >  
> >  See KVM_CAP_VCPU_EVENTS for more details.
> > +
> >  8.20 KVM_CAP_HYPERV_SEND_IPI
> >  
> >  Architectures: x86
> > @@ -5383,6 +5397,7 @@ Architectures: x86
> >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> >  hypercalls:
> >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >  
> >  Architecture: x86
> > @@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> >  in CPUID and only exposes Hyper-V identification. In this case, guest
> >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu.
> > +
> > +One dirty ring is defined as below internally:
> > +
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +	struct kvm_dirty_ring_indices *indices;
> > +	int index;
> > +};
> > +
> > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > +For each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > +        __u32 pad;  
> 
> How about sticking a length here?
> This way huge pages can be dirtied in one go.

Not just huge pages, but any contiguous range of dirty pages could be
reported far more concisely.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 16:40     ` Michael S. Tsirkin
@ 2020-01-09 17:08       ` Peter Xu
  2020-01-09 19:08         ` Michael S. Tsirkin
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 17:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:

[...]

> > > I know it's mostly relevant for huge VMs, but OTOH these
> > > probably use huge pages.
> > 
> > Yes huge VMs could benefit more, especially if the dirty rate is not
> > that high, I believe.  Though, could you elaborate on why huge pages
> > are special here?
> > 
> > Thanks,
> 
> With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> bit set marks 512 pages as dirty.  We do not take advantage of this
> but it looks like a rather obvious optimization.

Right, but isn't that the trade-off between granularity of dirty
tracking and how easy it is to collect the dirty bits?  Say, it'll be
merely impossible to migrate 1G-huge-page-backed guests if we track
dirty bits using huge page granularity, since each touch of guest
memory will cause another 1G memory to be transferred even if most of
the content is the same.  2M can be somewhere in the middle, but still
the same write amplify issue exists.

PS. that seems to be another topic after all besides the dirty ring
series because we need to change our policy first if we want to track
it with huge pages; with that, for dirty ring we can start to leverage
the kvm_dirty_gfn.pad to store the page size with another new kvm cap
when we really want.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 16:47 ` Alex Williamson
@ 2020-01-09 17:58   ` Peter Xu
  2020-01-09 19:13     ` Michael S. Tsirkin
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 17:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert,
	Kirti Wankhede

On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> On Thu,  9 Jan 2020 09:57:08 -0500
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > (based on kvm/queue)
> > 
> > Please refer to either the previous cover letters, or documentation
> > update in patch 12 for the big picture.  Previous posts:
> > 
> > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > 
> > The major change in V3 is that we dropped the whole waitqueue and the
> > global lock. With that, we have clean per-vcpu ring and no default
> > ring any more.  The two kvmgt refactoring patches were also included
> > to show the dependency of the works.
> 
> Hi Peter,

Hi, Alex,

> 
> Would you recommend this style of interface for vfio dirty page
> tracking as well?  This mechanism seems very tuned to sparse page
> dirtying, how well does it handle fully dirty, or even significantly
> dirty regions?

That's truely the point why I think the dirty bitmap can still be used
and should be kept.  IIUC the dirty ring starts from COLO where (1)
dirty rate is very low, and (2) sync happens frequently.  That's a
perfect ground for dirty ring.  However it for sure does not mean that
dirty ring can solve all the issues.  As you said, I believe the full
dirty is another extreme in that dirty bitmap could perform better.

> We also don't really have "active" dirty page tracking
> in vfio, we simply assume that if a page is pinned or otherwise mapped
> that it's dirty, so I think we'd constantly be trying to re-populate
> the dirty ring with pages that we've seen the user consume, which
> doesn't seem like a good fit versus a bitmap solution.  Thanks,

Right, so I confess I don't know whether dirty ring is the ideal
solutioon for vfio either.  Actually if we're tracking by page maps or
pinnings, then IMHO it also means that it could be more suitable to
use an modified version of dirty ring buffer (as you suggested in the
other thread), in that we can track dirty using (addr, len) range
rather than a single page address.  That could be hard for KVM because
in KVM the page will be mostly trapped in 4K granularity in page
faults, and it'll also be hard to merge continuous entries with
previous ones because the userspace could be reading the entries (so
after we publish the previous 4K dirty page, we should not modify the
entry any more).  VFIO should not have this restriction because the
marking of dirty page range can be atomic when the range of pages are
mapped or pinned.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 17:08       ` Peter Xu
@ 2020-01-09 19:08         ` Michael S. Tsirkin
  2020-01-09 19:39           ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 19:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> 
> [...]
> 
> > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > probably use huge pages.
> > > 
> > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > that high, I believe.  Though, could you elaborate on why huge pages
> > > are special here?
> > > 
> > > Thanks,
> > 
> > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > bit set marks 512 pages as dirty.  We do not take advantage of this
> > but it looks like a rather obvious optimization.
> 
> Right, but isn't that the trade-off between granularity of dirty
> tracking and how easy it is to collect the dirty bits?  Say, it'll be
> merely impossible to migrate 1G-huge-page-backed guests if we track
> dirty bits using huge page granularity, since each touch of guest
> memory will cause another 1G memory to be transferred even if most of
> the content is the same.  2M can be somewhere in the middle, but still
> the same write amplify issue exists.
>

OK I see I'm unclear.

IIUC at the moment KVM never uses huge pages if any part of the huge page is
tracked. But if all parts of the page are written to then huge page
is used.

In this situation the whole huge page is dirty and needs to be migrated.

> PS. that seems to be another topic after all besides the dirty ring
> series because we need to change our policy first if we want to track
> it with huge pages; with that, for dirty ring we can start to leverage
> the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> when we really want.
> 
> Thanks,

Seems like leaking implementation detail to UAPI to me.


> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 17:58   ` Peter Xu
@ 2020-01-09 19:13     ` Michael S. Tsirkin
  2020-01-09 19:23       ` Peter Xu
  2020-01-09 20:51       ` Paolo Bonzini
  0 siblings, 2 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 19:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Williamson, kvm, linux-kernel, Christophe de Dinechin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert,
	Kirti Wankhede

On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> > On Thu,  9 Jan 2020 09:57:08 -0500
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > (based on kvm/queue)
> > > 
> > > Please refer to either the previous cover letters, or documentation
> > > update in patch 12 for the big picture.  Previous posts:
> > > 
> > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > > 
> > > The major change in V3 is that we dropped the whole waitqueue and the
> > > global lock. With that, we have clean per-vcpu ring and no default
> > > ring any more.  The two kvmgt refactoring patches were also included
> > > to show the dependency of the works.
> > 
> > Hi Peter,
> 
> Hi, Alex,
> 
> > 
> > Would you recommend this style of interface for vfio dirty page
> > tracking as well?  This mechanism seems very tuned to sparse page
> > dirtying, how well does it handle fully dirty, or even significantly
> > dirty regions?
> 
> That's truely the point why I think the dirty bitmap can still be used
> and should be kept.  IIUC the dirty ring starts from COLO where (1)
> dirty rate is very low, and (2) sync happens frequently.  That's a
> perfect ground for dirty ring.  However it for sure does not mean that
> dirty ring can solve all the issues.  As you said, I believe the full
> dirty is another extreme in that dirty bitmap could perform better.
> 
> > We also don't really have "active" dirty page tracking
> > in vfio, we simply assume that if a page is pinned or otherwise mapped
> > that it's dirty, so I think we'd constantly be trying to re-populate
> > the dirty ring with pages that we've seen the user consume, which
> > doesn't seem like a good fit versus a bitmap solution.  Thanks,
> 
> Right, so I confess I don't know whether dirty ring is the ideal
> solutioon for vfio either.  Actually if we're tracking by page maps or
> pinnings, then IMHO it also means that it could be more suitable to
> use an modified version of dirty ring buffer (as you suggested in the
> other thread), in that we can track dirty using (addr, len) range
> rather than a single page address.  That could be hard for KVM because
> in KVM the page will be mostly trapped in 4K granularity in page
> faults, and it'll also be hard to merge continuous entries with
> previous ones because the userspace could be reading the entries (so
> after we publish the previous 4K dirty page, we should not modify the
> entry any more).

An easy way would be to keep a couple of entries around, not pushing
them into the ring until later.  In fact deferring queue write until
there's a bunch of data to be pushed is a very handy optimization.

When building UAPI's it makes sense to try and keep them generic
rather than tying them to a given implementation.

That's one of the reasons I called for using something
resembling vring_packed_desc.


> VFIO should not have this restriction because the
> marking of dirty page range can be atomic when the range of pages are
> mapped or pinned.
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 16:29   ` Michael S. Tsirkin
  2020-01-09 16:56     ` Alex Williamson
@ 2020-01-09 19:15     ` Peter Xu
  2020-01-09 19:35       ` Michael S. Tsirkin
  2020-01-19  9:09       ` Paolo Bonzini
  1 sibling, 2 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 19:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 11:29:28AM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).
> 
> No longer, this uses an array of structs.

(IMHO it's more or less a wording thing, because it's still an array
 of GFNs behind it...)

[...]

> > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > +For each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > +        __u32 pad;
> 
> How about sticking a length here?
> This way huge pages can be dirtied in one go.

As we've discussed previously, current KVM tracks dirty in 4K page
only, so it seems to be something that is not easily covered in this
series.

We probably need to justify on having KVM to track huge pages first,
or at least a trend that we're going to do that, then we can properly
reserve it here.

> 
> > +        __u32 slot; /* as_id | slot_id */
> > +        __u64 offset;
> > +};
> > +
> > +Most of the ring structure is used by KVM internally, while only the
> > +indices are exposed to userspace:
> > +
> > +struct kvm_dirty_ring_indices {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> > +The two indices in the ring buffer are free running counters.
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings.  It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two.
> 
> 
> I know index design is popular, but testing with virtio showed
> that it's better to just have a flags field marking
> an entry as valid. In particular this gets rid of the
> running counters and power of two limitations.
> It also removes the need for a separate index page, which is nice.

Firstly, note that the separate index page has already been dropped
since V2, so we don't need to worry on that.

Regarding dropping the indices: I feel like it can be done, though we
probably need two extra bits for each GFN entry, for example:

  - Bit 0 of the GFN address to show whether this is a valid publish
    of dirty gfn

  - Bit 1 of the GFN address to show whether this is collected by the
    user

We can also use the padding field, but just want to show the idea
first.

Then for each GFN we can go through state changes like this (things
like "00b" stands for "bit1 bit0" values):

  00b (invalid GFN) ->
    01b (valid gfn published by kernel, which is dirty) ->
      10b (gfn dirty page collected by userspace) ->
        00b (gfn reset by kernel, so goes back to invalid gfn)

And we should always guarantee that both the userspace and KVM walks
the GFN array in a linear manner, for example, KVM must publish a new
GFN with bit 1 set right after the previous publish of GFN.  Vice
versa to the userspace when it collects the dirty GFN and mark bit 2.

Michael, do you mean something like this?

I think it should work logically, however IIUC it can expose more
security risks, say, dirty ring is different from virtio in that
userspace is not trusted, while for virtio, both sides (hypervisor,
and the guest driver) are trusted.  Above means we need to do these to
change to the new design:

  - Allow the GFN array to be mapped as writable by userspace (so that
    userspace can publish bit 2),

  - The userspace must be trusted to follow the design (just imagine
    what if the userspace overwrites a GFN when it publishes bit 2
    over a valid dirty gfn entry?  KVM could wrongly unprotect a page
    for the guest...).

While if we use the indices, we restrict the userspace to only be able
to write to one index only (which is the reset_index).  That's all it
can do to mess things up (and it could never as long as we properly
validate the reset_index when read, which only happens during
KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
indices solution still has its benefits.

> 
> 
> 
> >  The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
> 
> Where's this number coming from? Given you have indices as well,
> 4K size rings is likely to cause cache contention.

I think we've had some similar discussion in previous versions on the
size of ring.  Again imho it's really something that may not have a
direct clue as long as it's big enough (4K should be).

Regarding to the cache contention: could you explain more?  Do you
have a suggestion on the size of ring instead considering the issue?

[...]

> > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 cur_slot, next_slot;
> > +	u64 cur_offset, next_offset;
> > +	unsigned long mask;
> > +	u32 fetch;
> > +	int count = 0;
> > +	struct kvm_dirty_gfn *entry;
> > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > +	bool first_round = true;
> > +
> > +	fetch = READ_ONCE(indices->fetch_index);
> 
> So this does not work if the data cache is virtually tagged.
> Which to the best of my knowledge isn't the case on any
> CPU kvm supports. However it might not stay being the
> case forever. Worth at least commenting.

This is the read side.  IIUC even if with virtually tagged archs, we
should do the flushing on the write side rather than the read side,
and that should be enough?

Also, I believe this is the similar question that Jason has asked in
V2.  Sorry I should mention this earlier, but I didn't address that in
this series because if we need to do so we probably need to do it
kvm-wise, rather than only in this series.  I feel like it's missing
probably only because all existing KVM supported archs do not have
virtual-tagged caches as you mentioned.  If so, I would prefer if you
can allow me to ignore that issue until KVM starts to support such an
arch.

> 
> 
> > +
> > +	/*
> > +	 * Note that fetch_index is written by the userspace, which
> > +	 * should not be trusted.  If this happens, then it's probably
> > +	 * that the userspace has written a wrong fetch_index.
> > +	 */
> > +	if (fetch - ring->reset_index > ring->size)
> > +		return -EINVAL;
> > +
> > +	if (fetch == ring->reset_index)
> > +		return 0;
> > +
> > +	/* This is only needed to make compilers happy */
> > +	cur_slot = cur_offset = mask = 0;
> > +	while (ring->reset_index != fetch) {
> > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +		next_slot = READ_ONCE(entry->slot);
> > +		next_offset = READ_ONCE(entry->offset);
> 
> What is this READ_ONCE doing? Entries are only written by kernel
> and it's under lock.

The entries are written in kvm_dirty_ring_push() where there should
have no lock (there's one wmb() though to guarantee ordering of these
and the index update).

With the wmb(), the write side should guarantee to make it to memory.
For the read side here, I think it's still good to have it to make
sure we read from memory?

> 
> > +		ring->reset_index++;
> > +		count++;
> > +		/*
> > +		 * Try to coalesce the reset operations when the guest is
> > +		 * scanning pages in the same slot.
> > +		 */
> > +		if (!first_round && next_slot == cur_slot) {
> > +			s64 delta = next_offset - cur_offset;
> > +
> > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > +				mask |= 1ull << delta;
> > +				continue;
> > +			}
> > +
> > +			/* Backwards visit, careful about overflows!  */
> > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > +			    (mask << -delta >> -delta) == mask) {
> > +				cur_offset = next_offset;
> > +				mask = (mask << -delta) | 1;
> > +				continue;
> > +			}
> > +		}
> 
> Well how important is this logic? Because it will not be
> too effective on an SMP system, so don't you need a per-cpu ring?

It's my fault to have omit the high-level design in the cover letter,
but we do have per-vcpu ring now.  Actually that's what we only have
(we dropped the per-vm ring already) so ring access does not need lock
any more.

This logic is good because kvm_reset_dirty_gfn, especially inside that
there's kvm_arch_mmu_enable_log_dirty_pt_masked() that supports masks,
so it would be good to do the reset for continuous pages (or page
that's close enough) in a single shot.

> 
> 
> 
> > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +		cur_slot = next_slot;
> > +		cur_offset = next_offset;
> > +		mask = 1;
> > +		first_round = false;
> > +	}
> > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +
> > +	trace_kvm_dirty_ring_reset(ring);
> > +
> > +	return count;
> > +}
> > +
> > +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > +{
> > +	struct kvm_dirty_gfn *entry;
> > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > +
> > +	/* It should never get full */
> > +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> > +
> > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > +	entry->slot = slot;
> > +	entry->offset = offset;
> > +	/*
> > +	 * Make sure the data is filled in before we publish this to
> > +	 * the userspace program.  There's no paired kernel-side reader.
> > +	 */
> > +	smp_wmb();
> > +	ring->dirty_index++;
> 
> 
> Do I understand it correctly that the ring is shared between CPUs?
> If so I don't understand why it's safe for SMP guests.
> Don't you need atomics or locking?

No, it's per-vcpu.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 16:56     ` Alex Williamson
@ 2020-01-09 19:21       ` Peter Xu
  2020-01-09 19:36         ` Michael S. Tsirkin
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 19:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, kvm, linux-kernel, Christophe de Dinechin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 09:56:10AM -0700, Alex Williamson wrote:

[...]

> > > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > > +For each of the dirty entry it's defined as:
> > > +
> > > +struct kvm_dirty_gfn {
> > > +        __u32 pad;  
> > 
> > How about sticking a length here?
> > This way huge pages can be dirtied in one go.
> 
> Not just huge pages, but any contiguous range of dirty pages could be
> reported far more concisely.  Thanks,

I replied in the other thread on why I thought KVM might not suite
that (while vfio may).

Actually we can even do that for KVM as long as we keep a per-vcpu
last-dirtied GFN range cache (so we don't publish a dirty GFN right
after it's dirtied), then we grow that cached dirtied range as long as
the continuous next/previous page is dirtied.  If we found that the
current dirty GFN is not continuous to the cached range, we publish
the cached range and let the new GFN be the starting of last-dirtied
GFN range cache.

However I am not sure how much we'll gain from it.  Maybe we can do
that when we have a real use case for it.  For now I'm not sure
whether it would worth the effort.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 19:13     ` Michael S. Tsirkin
@ 2020-01-09 19:23       ` Peter Xu
  2020-01-09 19:37         ` Michael S. Tsirkin
  2020-01-09 20:51       ` Paolo Bonzini
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, kvm, linux-kernel, Christophe de Dinechin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert,
	Kirti Wankhede

On Thu, Jan 09, 2020 at 02:13:54PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> > > On Thu,  9 Jan 2020 09:57:08 -0500
> > > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > > (based on kvm/queue)
> > > > 
> > > > Please refer to either the previous cover letters, or documentation
> > > > update in patch 12 for the big picture.  Previous posts:
> > > > 
> > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > > > 
> > > > The major change in V3 is that we dropped the whole waitqueue and the
> > > > global lock. With that, we have clean per-vcpu ring and no default
> > > > ring any more.  The two kvmgt refactoring patches were also included
> > > > to show the dependency of the works.
> > > 
> > > Hi Peter,
> > 
> > Hi, Alex,
> > 
> > > 
> > > Would you recommend this style of interface for vfio dirty page
> > > tracking as well?  This mechanism seems very tuned to sparse page
> > > dirtying, how well does it handle fully dirty, or even significantly
> > > dirty regions?
> > 
> > That's truely the point why I think the dirty bitmap can still be used
> > and should be kept.  IIUC the dirty ring starts from COLO where (1)
> > dirty rate is very low, and (2) sync happens frequently.  That's a
> > perfect ground for dirty ring.  However it for sure does not mean that
> > dirty ring can solve all the issues.  As you said, I believe the full
> > dirty is another extreme in that dirty bitmap could perform better.
> > 
> > > We also don't really have "active" dirty page tracking
> > > in vfio, we simply assume that if a page is pinned or otherwise mapped
> > > that it's dirty, so I think we'd constantly be trying to re-populate
> > > the dirty ring with pages that we've seen the user consume, which
> > > doesn't seem like a good fit versus a bitmap solution.  Thanks,
> > 
> > Right, so I confess I don't know whether dirty ring is the ideal
> > solutioon for vfio either.  Actually if we're tracking by page maps or
> > pinnings, then IMHO it also means that it could be more suitable to
> > use an modified version of dirty ring buffer (as you suggested in the
> > other thread), in that we can track dirty using (addr, len) range
> > rather than a single page address.  That could be hard for KVM because
> > in KVM the page will be mostly trapped in 4K granularity in page
> > faults, and it'll also be hard to merge continuous entries with
> > previous ones because the userspace could be reading the entries (so
> > after we publish the previous 4K dirty page, we should not modify the
> > entry any more).
> 
> An easy way would be to keep a couple of entries around, not pushing
> them into the ring until later.  In fact deferring queue write until
> there's a bunch of data to be pushed is a very handy optimization.

I feel like I proposed similar thing in the other thread. :-)

> 
> When building UAPI's it makes sense to try and keep them generic
> rather than tying them to a given implementation.
> 
> That's one of the reasons I called for using something
> resembling vring_packed_desc.

But again, I just want to make sure I don't over-engineer...

I'll wait for further feedback from others for this.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 19:15     ` Peter Xu
@ 2020-01-09 19:35       ` Michael S. Tsirkin
  2020-01-09 20:19         ` Peter Xu
  2020-01-14 20:01         ` Peter Xu
  2020-01-19  9:09       ` Paolo Bonzini
  1 sibling, 2 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 19:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 02:15:14PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 11:29:28AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > > This patch is heavily based on previous work from Lei Cao
> > > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > > 
> > > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > > are copied to userspace when userspace queries KVM for its dirty page
> > > information.  The use of bitmaps is mostly sufficient for live
> > > migration, as large parts of memory are be dirtied from one log-dirty
> > > pass to another.  However, in a checkpointing system, the number of
> > > dirty pages is small and in fact it is often bounded---the VM is
> > > paused when it has dirtied a pre-defined number of pages. Traversing a
> > > large, sparsely populated bitmap to find set bits is time-consuming,
> > > as is copying the bitmap to user-space.
> > > 
> > > A similar issue will be there for live migration when the guest memory
> > > is huge while the page dirty procedure is trivial.  In that case for
> > > each dirty sync we need to pull the whole dirty bitmap to userspace
> > > and analyse every bit even if it's mostly zeros.
> > > 
> > > The preferred data structure for above scenarios is a dense list of
> > > guest frame numbers (GFN).
> > 
> > No longer, this uses an array of structs.
> 
> (IMHO it's more or less a wording thing, because it's still an array
>  of GFNs behind it...)
> 
> [...]
> 
> > > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > > +For each of the dirty entry it's defined as:
> > > +
> > > +struct kvm_dirty_gfn {
> > > +        __u32 pad;
> > 
> > How about sticking a length here?
> > This way huge pages can be dirtied in one go.
> 
> As we've discussed previously, current KVM tracks dirty in 4K page
> only, so it seems to be something that is not easily covered in this
> series.
> 
> We probably need to justify on having KVM to track huge pages first,
> or at least a trend that we're going to do that, then we can properly
> reserve it here.
> 
> > 
> > > +        __u32 slot; /* as_id | slot_id */
> > > +        __u64 offset;
> > > +};
> > > +
> > > +Most of the ring structure is used by KVM internally, while only the
> > > +indices are exposed to userspace:
> > > +
> > > +struct kvm_dirty_ring_indices {
> > > +	__u32 avail_index; /* set by kernel */
> > > +	__u32 fetch_index; /* set by userspace */
> > > +};
> > > +
> > > +The two indices in the ring buffer are free running counters.
> > > +
> > > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > > +to enable this capability for the new guest and set the size of the
> > > +rings.  It is only allowed before creating any vCPU, and the size of
> > > +the ring must be a power of two.
> > 
> > 
> > I know index design is popular, but testing with virtio showed
> > that it's better to just have a flags field marking
> > an entry as valid. In particular this gets rid of the
> > running counters and power of two limitations.
> > It also removes the need for a separate index page, which is nice.
> 
> Firstly, note that the separate index page has already been dropped
> since V2, so we don't need to worry on that.

changelog would be nice.
So now, how does userspace tell kvm it's done with the ring?

> Regarding dropping the indices: I feel like it can be done, though we
> probably need two extra bits for each GFN entry, for example:
> 
>   - Bit 0 of the GFN address to show whether this is a valid publish
>     of dirty gfn
> 
>   - Bit 1 of the GFN address to show whether this is collected by the
>     user


I wonder whether you will end up reinventing virtio.
You are already pretty close with avail/used bits in flags.



> We can also use the padding field, but just want to show the idea
> first.
> 
> Then for each GFN we can go through state changes like this (things
> like "00b" stands for "bit1 bit0" values):
> 
>   00b (invalid GFN) ->
>     01b (valid gfn published by kernel, which is dirty) ->
>       10b (gfn dirty page collected by userspace) ->
>         00b (gfn reset by kernel, so goes back to invalid gfn)
> 
> And we should always guarantee that both the userspace and KVM walks
> the GFN array in a linear manner, for example, KVM must publish a new
> GFN with bit 1 set right after the previous publish of GFN.  Vice
> versa to the userspace when it collects the dirty GFN and mark bit 2.
> 
> Michael, do you mean something like this?
> 
> I think it should work logically, however IIUC it can expose more
> security risks, say, dirty ring is different from virtio in that
> userspace is not trusted,

In what sense?

> while for virtio, both sides (hypervisor,
> and the guest driver) are trusted.

What gave you the impression guest is trusted in virtio?


>  Above means we need to do these to
> change to the new design:
> 
>   - Allow the GFN array to be mapped as writable by userspace (so that
>     userspace can publish bit 2),
> 
>   - The userspace must be trusted to follow the design (just imagine
>     what if the userspace overwrites a GFN when it publishes bit 2
>     over a valid dirty gfn entry?  KVM could wrongly unprotect a page
>     for the guest...).

You mean protect, right?  So what?

> While if we use the indices, we restrict the userspace to only be able
> to write to one index only (which is the reset_index).  That's all it
> can do to mess things up (and it could never as long as we properly
> validate the reset_index when read, which only happens during
> KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
> indices solution still has its benefits.

So if you mess up index how is this different?

I agree RO page kind of feels safer generally though.

I will have to re-read how does the ring works though,
my comments were based on the old assumption of mmaped
page with indices.



> > 
> > 
> > 
> > >  The larger the ring buffer, the less
> > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > +optimal size depends on the workload, but it is recommended that it be
> > > +at least 64 KiB (4096 entries).
> > 
> > Where's this number coming from? Given you have indices as well,
> > 4K size rings is likely to cause cache contention.
> 
> I think we've had some similar discussion in previous versions on the
> size of ring.  Again imho it's really something that may not have a
> direct clue as long as it's big enough (4K should be).
> 
> Regarding to the cache contention: could you explain more?

4K is a whole cache way. 64K 16 ways.  If there's anything else is a hot
path then you are pushing everything out of cache.  To re-read how do
indices work so see whether an index is on hot path or not. If yes your
structure won't fit in L1 cache which is not great.


>  Do you
> have a suggestion on the size of ring instead considering the issue?
> 
> [...]

I'll have to re-learn how do things work with indices gone
from shared memory.

> > > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > +{
> > > +	u32 cur_slot, next_slot;
> > > +	u64 cur_offset, next_offset;
> > > +	unsigned long mask;
> > > +	u32 fetch;
> > > +	int count = 0;
> > > +	struct kvm_dirty_gfn *entry;
> > > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > > +	bool first_round = true;
> > > +
> > > +	fetch = READ_ONCE(indices->fetch_index);
> > 
> > So this does not work if the data cache is virtually tagged.
> > Which to the best of my knowledge isn't the case on any
> > CPU kvm supports. However it might not stay being the
> > case forever. Worth at least commenting.
> 
> This is the read side.  IIUC even if with virtually tagged archs, we
> should do the flushing on the write side rather than the read side,
> and that should be enough?

No.
See e.g.  Documentation/core-api/cachetlb.rst

  ``void flush_dcache_page(struct page *page)``

        Any time the kernel writes to a page cache page, _OR_
        the kernel is about to read from a page cache page and
        user space shared/writable mappings of this page potentially
        exist, this routine is called.


> Also, I believe this is the similar question that Jason has asked in
> V2.  Sorry I should mention this earlier, but I didn't address that in
> this series because if we need to do so we probably need to do it
> kvm-wise, rather than only in this series.

You need to document these things.

>  I feel like it's missing
> probably only because all existing KVM supported archs do not have
> virtual-tagged caches as you mentioned.

But is that a fact? ARM has such a variety of CPUs,
I can't really tell. Did you research this to make sure?

> If so, I would prefer if you
> can allow me to ignore that issue until KVM starts to support such an
> arch.

Document limitations pls.  Don't ignore them.

> > 
> > 
> > > +
> > > +	/*
> > > +	 * Note that fetch_index is written by the userspace, which
> > > +	 * should not be trusted.  If this happens, then it's probably
> > > +	 * that the userspace has written a wrong fetch_index.
> > > +	 */
> > > +	if (fetch - ring->reset_index > ring->size)
> > > +		return -EINVAL;
> > > +
> > > +	if (fetch == ring->reset_index)
> > > +		return 0;
> > > +
> > > +	/* This is only needed to make compilers happy */
> > > +	cur_slot = cur_offset = mask = 0;
> > > +	while (ring->reset_index != fetch) {
> > > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > +		next_slot = READ_ONCE(entry->slot);
> > > +		next_offset = READ_ONCE(entry->offset);
> > 
> > What is this READ_ONCE doing? Entries are only written by kernel
> > and it's under lock.
> 
> The entries are written in kvm_dirty_ring_push() where there should
> have no lock (there's one wmb() though to guarantee ordering of these
> and the index update).
> 
> With the wmb(), the write side should guarantee to make it to memory.
> For the read side here, I think it's still good to have it to make
> sure we read from memory?
> 
> > 
> > > +		ring->reset_index++;
> > > +		count++;
> > > +		/*
> > > +		 * Try to coalesce the reset operations when the guest is
> > > +		 * scanning pages in the same slot.
> > > +		 */
> > > +		if (!first_round && next_slot == cur_slot) {
> > > +			s64 delta = next_offset - cur_offset;
> > > +
> > > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > > +				mask |= 1ull << delta;
> > > +				continue;
> > > +			}
> > > +
> > > +			/* Backwards visit, careful about overflows!  */
> > > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > > +			    (mask << -delta >> -delta) == mask) {
> > > +				cur_offset = next_offset;
> > > +				mask = (mask << -delta) | 1;
> > > +				continue;
> > > +			}
> > > +		}
> > 
> > Well how important is this logic? Because it will not be
> > too effective on an SMP system, so don't you need a per-cpu ring?
> 
> It's my fault to have omit the high-level design in the cover letter,
> but we do have per-vcpu ring now.  Actually that's what we only have
> (we dropped the per-vm ring already) so ring access does not need lock
> any more.
> 
> This logic is good because kvm_reset_dirty_gfn, especially inside that
> there's kvm_arch_mmu_enable_log_dirty_pt_masked() that supports masks,
> so it would be good to do the reset for continuous pages (or page
> that's close enough) in a single shot.
> 
> > 
> > 
> > 
> > > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +		cur_slot = next_slot;
> > > +		cur_offset = next_offset;
> > > +		mask = 1;
> > > +		first_round = false;
> > > +	}
> > > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > +	trace_kvm_dirty_ring_reset(ring);
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > > +{
> > > +	struct kvm_dirty_gfn *entry;
> > > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > > +
> > > +	/* It should never get full */
> > > +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> > > +
> > > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > +	entry->slot = slot;
> > > +	entry->offset = offset;
> > > +	/*
> > > +	 * Make sure the data is filled in before we publish this to
> > > +	 * the userspace program.  There's no paired kernel-side reader.
> > > +	 */
> > > +	smp_wmb();
> > > +	ring->dirty_index++;
> > 
> > 
> > Do I understand it correctly that the ring is shared between CPUs?
> > If so I don't understand why it's safe for SMP guests.
> > Don't you need atomics or locking?
> 
> No, it's per-vcpu.
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 19:21       ` Peter Xu
@ 2020-01-09 19:36         ` Michael S. Tsirkin
  0 siblings, 0 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 19:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Williamson, kvm, linux-kernel, Christophe de Dinechin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 02:21:16PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 09:56:10AM -0700, Alex Williamson wrote:
> 
> [...]
> 
> > > > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > > > +For each of the dirty entry it's defined as:
> > > > +
> > > > +struct kvm_dirty_gfn {
> > > > +        __u32 pad;  
> > > 
> > > How about sticking a length here?
> > > This way huge pages can be dirtied in one go.
> > 
> > Not just huge pages, but any contiguous range of dirty pages could be
> > reported far more concisely.  Thanks,
> 
> I replied in the other thread on why I thought KVM might not suite
> that (while vfio may).
> 
> Actually we can even do that for KVM as long as we keep a per-vcpu
> last-dirtied GFN range cache (so we don't publish a dirty GFN right
> after it's dirtied), then we grow that cached dirtied range as long as
> the continuous next/previous page is dirtied.  If we found that the
> current dirty GFN is not continuous to the cached range, we publish
> the cached range and let the new GFN be the starting of last-dirtied
> GFN range cache.
> 
> However I am not sure how much we'll gain from it.  Maybe we can do
> that when we have a real use case for it.  For now I'm not sure
> whether it would worth the effort.
> 
> Thanks,

I agree for the implementation but I think UAPI should support that
from ground up so we don't need to support two kinds of formats.

> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 19:23       ` Peter Xu
@ 2020-01-09 19:37         ` Michael S. Tsirkin
  0 siblings, 0 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 19:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Williamson, kvm, linux-kernel, Christophe de Dinechin,
	Paolo Bonzini, Sean Christopherson, Yan Zhao, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert,
	Kirti Wankhede

On Thu, Jan 09, 2020 at 02:23:18PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 02:13:54PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote:
> > > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> > > > On Thu,  9 Jan 2020 09:57:08 -0500
> > > > Peter Xu <peterx@redhat.com> wrote:
> > > > 
> > > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > > > (based on kvm/queue)
> > > > > 
> > > > > Please refer to either the previous cover letters, or documentation
> > > > > update in patch 12 for the big picture.  Previous posts:
> > > > > 
> > > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > > > > 
> > > > > The major change in V3 is that we dropped the whole waitqueue and the
> > > > > global lock. With that, we have clean per-vcpu ring and no default
> > > > > ring any more.  The two kvmgt refactoring patches were also included
> > > > > to show the dependency of the works.
> > > > 
> > > > Hi Peter,
> > > 
> > > Hi, Alex,
> > > 
> > > > 
> > > > Would you recommend this style of interface for vfio dirty page
> > > > tracking as well?  This mechanism seems very tuned to sparse page
> > > > dirtying, how well does it handle fully dirty, or even significantly
> > > > dirty regions?
> > > 
> > > That's truely the point why I think the dirty bitmap can still be used
> > > and should be kept.  IIUC the dirty ring starts from COLO where (1)
> > > dirty rate is very low, and (2) sync happens frequently.  That's a
> > > perfect ground for dirty ring.  However it for sure does not mean that
> > > dirty ring can solve all the issues.  As you said, I believe the full
> > > dirty is another extreme in that dirty bitmap could perform better.
> > > 
> > > > We also don't really have "active" dirty page tracking
> > > > in vfio, we simply assume that if a page is pinned or otherwise mapped
> > > > that it's dirty, so I think we'd constantly be trying to re-populate
> > > > the dirty ring with pages that we've seen the user consume, which
> > > > doesn't seem like a good fit versus a bitmap solution.  Thanks,
> > > 
> > > Right, so I confess I don't know whether dirty ring is the ideal
> > > solutioon for vfio either.  Actually if we're tracking by page maps or
> > > pinnings, then IMHO it also means that it could be more suitable to
> > > use an modified version of dirty ring buffer (as you suggested in the
> > > other thread), in that we can track dirty using (addr, len) range
> > > rather than a single page address.  That could be hard for KVM because
> > > in KVM the page will be mostly trapped in 4K granularity in page
> > > faults, and it'll also be hard to merge continuous entries with
> > > previous ones because the userspace could be reading the entries (so
> > > after we publish the previous 4K dirty page, we should not modify the
> > > entry any more).
> > 
> > An easy way would be to keep a couple of entries around, not pushing
> > them into the ring until later.  In fact deferring queue write until
> > there's a bunch of data to be pushed is a very handy optimization.
> 
> I feel like I proposed similar thing in the other thread. :-)
> 
> > 
> > When building UAPI's it makes sense to try and keep them generic
> > rather than tying them to a given implementation.
> > 
> > That's one of the reasons I called for using something
> > resembling vring_packed_desc.
> 
> But again, I just want to make sure I don't over-engineer...


You will now when you start profiling in earnest.

> I'll wait for further feedback from others for this.
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 19:08         ` Michael S. Tsirkin
@ 2020-01-09 19:39           ` Peter Xu
  2020-01-09 20:42             ` Paolo Bonzini
  2020-01-09 22:28             ` Michael S. Tsirkin
  0 siblings, 2 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-09 19:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> > 
> > [...]
> > 
> > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > probably use huge pages.
> > > > 
> > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > that high, I believe.  Though, could you elaborate on why huge pages
> > > > are special here?
> > > > 
> > > > Thanks,
> > > 
> > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > bit set marks 512 pages as dirty.  We do not take advantage of this
> > > but it looks like a rather obvious optimization.
> > 
> > Right, but isn't that the trade-off between granularity of dirty
> > tracking and how easy it is to collect the dirty bits?  Say, it'll be
> > merely impossible to migrate 1G-huge-page-backed guests if we track
> > dirty bits using huge page granularity, since each touch of guest
> > memory will cause another 1G memory to be transferred even if most of
> > the content is the same.  2M can be somewhere in the middle, but still
> > the same write amplify issue exists.
> >
> 
> OK I see I'm unclear.
> 
> IIUC at the moment KVM never uses huge pages if any part of the huge page is
> tracked.

To be more precise - I think it's per-memslot.  Say, if the memslot is
dirty tracked, then no huge page on the host on that memslot (even if
guest used huge page over that).

> But if all parts of the page are written to then huge page
> is used.

I'm not sure of this... I think it's still in 4K granularity.

> 
> In this situation the whole huge page is dirty and needs to be migrated.

Note that in QEMU we always migrate pages in 4K for x86, iiuc (please
refer to ram_save_host_page() in QEMU).

> 
> > PS. that seems to be another topic after all besides the dirty ring
> > series because we need to change our policy first if we want to track
> > it with huge pages; with that, for dirty ring we can start to leverage
> > the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> > when we really want.
> > 
> > Thanks,
> 
> Seems like leaking implementation detail to UAPI to me.

I'd say it's not the only place we have an assumption at least (please
also refer to uffd_msg.pagefault.address).  IMHO it's not something
wrong because interfaces can be extended, but I am open to extending
kvm_dirty_gfn to cover a length/size or make the pad larger (as long
as Paolo is fine with this).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 19:35       ` Michael S. Tsirkin
@ 2020-01-09 20:19         ` Peter Xu
  2020-01-09 22:18           ` Michael S. Tsirkin
  2020-01-14 20:01         ` Peter Xu
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-09 20:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 02:35:46PM -0500, Michael S. Tsirkin wrote:

[...]

> > > I know index design is popular, but testing with virtio showed
> > > that it's better to just have a flags field marking
> > > an entry as valid. In particular this gets rid of the
> > > running counters and power of two limitations.
> > > It also removes the need for a separate index page, which is nice.
> > 
> > Firstly, note that the separate index page has already been dropped
> > since V2, so we don't need to worry on that.
> 
> changelog would be nice.

Actually I mentioned it in V2:

https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com/

There's a section "Per-vm ring is dropped".  But it's indeed hiding
behind the wall that the index page is bound to the per-vm ring...
I'll try to be more clear in the cover letter in the future.

> So now, how does userspace tell kvm it's done with the ring?

It should rarely tell unless the ring reaches soft-full, in that case
the vcpu KVM_RUN will return with KVM_EXIT_DIRTY_RING_FULL.

> 
> > Regarding dropping the indices: I feel like it can be done, though we
> > probably need two extra bits for each GFN entry, for example:
> > 
> >   - Bit 0 of the GFN address to show whether this is a valid publish
> >     of dirty gfn
> > 
> >   - Bit 1 of the GFN address to show whether this is collected by the
> >     user
> 
> 
> I wonder whether you will end up reinventing virtio.
> You are already pretty close with avail/used bits in flags.
> 
> 
> 
> > We can also use the padding field, but just want to show the idea
> > first.
> > 
> > Then for each GFN we can go through state changes like this (things
> > like "00b" stands for "bit1 bit0" values):
> > 
> >   00b (invalid GFN) ->
> >     01b (valid gfn published by kernel, which is dirty) ->
> >       10b (gfn dirty page collected by userspace) ->
> >         00b (gfn reset by kernel, so goes back to invalid gfn)
> > 
> > And we should always guarantee that both the userspace and KVM walks
> > the GFN array in a linear manner, for example, KVM must publish a new
> > GFN with bit 1 set right after the previous publish of GFN.  Vice
> > versa to the userspace when it collects the dirty GFN and mark bit 2.
> > 
> > Michael, do you mean something like this?
> > 
> > I think it should work logically, however IIUC it can expose more
> > security risks, say, dirty ring is different from virtio in that
> > userspace is not trusted,
> 
> In what sense?

In the sense of general syscalls?  Like, we shouldn't allow the kernel
to break and go wild no matter what the userspace does?

> 
> > while for virtio, both sides (hypervisor,
> > and the guest driver) are trusted.
> 
> What gave you the impression guest is trusted in virtio?

Hmm... maybe when I know virtio can bypass vIOMMU as long as it
doesn't provide IOMMU_PLATFORM flag? :)

I think it's logical to trust a virtio guest kernel driver, could you
guide me on what I've missed?

> 
> 
> >  Above means we need to do these to
> > change to the new design:
> > 
> >   - Allow the GFN array to be mapped as writable by userspace (so that
> >     userspace can publish bit 2),
> > 
> >   - The userspace must be trusted to follow the design (just imagine
> >     what if the userspace overwrites a GFN when it publishes bit 2
> >     over a valid dirty gfn entry?  KVM could wrongly unprotect a page
> >     for the guest...).
> 
> You mean protect, right?  So what?

Yes, I mean with that, more things are uncertain from userspace.  It
seems easier to me that we restrict the userspace with one index.

> 
> > While if we use the indices, we restrict the userspace to only be able
> > to write to one index only (which is the reset_index).  That's all it
> > can do to mess things up (and it could never as long as we properly
> > validate the reset_index when read, which only happens during
> > KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
> > indices solution still has its benefits.
> 
> So if you mess up index how is this different?

We can't mess up much with that.  We simply check fetch_index (sorry I
meant this when I said reset_index, anyway it's the only index that we
expose to userspace) to make sure:

  reset_index <= fetch_index <= dirty_index

Otherwise we fail the ioctl.  With that, we're 100% safe.

> 
> I agree RO page kind of feels safer generally though.
> 
> I will have to re-read how does the ring works though,
> my comments were based on the old assumption of mmaped
> page with indices.

Yes, sorry again for a bad cover letter.

It's basically the same as before, just that we only have per-vcpu
ring now, and the indices are exposed from kvm_run so we don't need
the extra page, but we still expose that via mmap.

> 
> 
> 
> > > 
> > > 
> > > 
> > > >  The larger the ring buffer, the less
> > > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > > +optimal size depends on the workload, but it is recommended that it be
> > > > +at least 64 KiB (4096 entries).
> > > 
> > > Where's this number coming from? Given you have indices as well,
> > > 4K size rings is likely to cause cache contention.
> > 
> > I think we've had some similar discussion in previous versions on the
> > size of ring.  Again imho it's really something that may not have a
> > direct clue as long as it's big enough (4K should be).
> > 
> > Regarding to the cache contention: could you explain more?
> 
> 4K is a whole cache way. 64K 16 ways.  If there's anything else is a hot
> path then you are pushing everything out of cache.  To re-read how do
> indices work so see whether an index is on hot path or not. If yes your
> structure won't fit in L1 cache which is not great.

I'm not sure whether I get the point correct, but logically we
shouldn't read the whole ring buffer as a whole, but only partly (just
like when we say the ring shouldn't even reach soft-full).  Even if we
read the whole ring, I don't see a difference here comparing to when
we read a huge array of data (e.g. "char buf[65536]") in any program
that covers 64K range - I don't see a good way to fix this but read
the whole chunk in.  It seems to be common in programs where we have
big dataset.

[...]

> > > > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > > +{
> > > > +	u32 cur_slot, next_slot;
> > > > +	u64 cur_offset, next_offset;
> > > > +	unsigned long mask;
> > > > +	u32 fetch;
> > > > +	int count = 0;
> > > > +	struct kvm_dirty_gfn *entry;
> > > > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > > > +	bool first_round = true;
> > > > +
> > > > +	fetch = READ_ONCE(indices->fetch_index);
> > > 
> > > So this does not work if the data cache is virtually tagged.
> > > Which to the best of my knowledge isn't the case on any
> > > CPU kvm supports. However it might not stay being the
> > > case forever. Worth at least commenting.
> > 
> > This is the read side.  IIUC even if with virtually tagged archs, we
> > should do the flushing on the write side rather than the read side,
> > and that should be enough?
> 
> No.
> See e.g.  Documentation/core-api/cachetlb.rst
> 
>   ``void flush_dcache_page(struct page *page)``
> 
>         Any time the kernel writes to a page cache page, _OR_
>         the kernel is about to read from a page cache page and
>         user space shared/writable mappings of this page potentially
>         exist, this routine is called.

But I don't understand why.  I feel like for such arch even the
userspace must flush cache after publishing data onto shared memories,
otherwise if the shared memory is between two userspace processes
they'll get inconsistent state.  Then if with that, I'm confused on
why the read side needs to flush it again.

> 
> 
> > Also, I believe this is the similar question that Jason has asked in
> > V2.  Sorry I should mention this earlier, but I didn't address that in
> > this series because if we need to do so we probably need to do it
> > kvm-wise, rather than only in this series.
> 
> You need to document these things.
> 
> >  I feel like it's missing
> > probably only because all existing KVM supported archs do not have
> > virtual-tagged caches as you mentioned.
> 
> But is that a fact? ARM has such a variety of CPUs,
> I can't really tell. Did you research this to make sure?

I didn't.  I only tried to find all callers of flush_dcache_page()
through the whole Linux tree and I cannot see any kvm related code.
To make this simple, let me address the dcache flushing issue in the
next post.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 19:39           ` Peter Xu
@ 2020-01-09 20:42             ` Paolo Bonzini
  2020-01-09 22:28             ` Michael S. Tsirkin
  1 sibling, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-09 20:42 UTC (permalink / raw)
  To: Peter Xu, Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On 09/01/20 20:39, Peter Xu wrote:
>>
>> IIUC at the moment KVM never uses huge pages if any part of the huge page is
>> tracked.
>
> To be more precise - I think it's per-memslot.  Say, if the memslot is
> dirty tracked, then no huge page on the host on that memslot (even if
> guest used huge page over that).
> 
>> But if all parts of the page are written to then huge page
>> is used.
>
> I'm not sure of this... I think it's still in 4K granularity.

Right.  Dirty tracking always uses 4K page size.

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 19:13     ` Michael S. Tsirkin
  2020-01-09 19:23       ` Peter Xu
@ 2020-01-09 20:51       ` Paolo Bonzini
  2020-01-09 22:21         ` Michael S. Tsirkin
  1 sibling, 1 reply; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-09 20:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: Alex Williamson, kvm, linux-kernel, Christophe de Dinechin,
	Sean Christopherson, Yan Zhao, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, Kirti Wankhede

On 09/01/20 20:13, Michael S. Tsirkin wrote:
> That's one of the reasons I called for using something
> resembling vring_packed_desc.

In principle it could make sense to use the ring-wrap detection
mechanism from vring_packed_desc instead of the producer/consumer
indices.  However, the element address/length indirection is unnecessary.

Also, unlike virtio, KVM needs to know if there are N free entries (N is
~512) before running a guest.  I'm not sure if that is possible with
ring-wrap counters, while it's trivial with producer/consumer indices.

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 20:19         ` Peter Xu
@ 2020-01-09 22:18           ` Michael S. Tsirkin
  2020-01-10 15:29             ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 22:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 03:19:16PM -0500, Peter Xu wrote:
> > > while for virtio, both sides (hypervisor,
> > > and the guest driver) are trusted.
> > 
> > What gave you the impression guest is trusted in virtio?
> 
> Hmm... maybe when I know virtio can bypass vIOMMU as long as it
> doesn't provide IOMMU_PLATFORM flag? :)

If guest driver does not provide IOMMU_PLATFORM, and device does,
then negotiation fails.

> I think it's logical to trust a virtio guest kernel driver, could you
> guide me on what I've missed?


guest driver is assumed to be part of guest kernel. It can't
do anything kernel can't do anyway.

> > 
> > 
> > >  Above means we need to do these to
> > > change to the new design:
> > > 
> > >   - Allow the GFN array to be mapped as writable by userspace (so that
> > >     userspace can publish bit 2),
> > > 
> > >   - The userspace must be trusted to follow the design (just imagine
> > >     what if the userspace overwrites a GFN when it publishes bit 2
> > >     over a valid dirty gfn entry?  KVM could wrongly unprotect a page
> > >     for the guest...).
> > 
> > You mean protect, right?  So what?
> 
> Yes, I mean with that, more things are uncertain from userspace.  It
> seems easier to me that we restrict the userspace with one index.

Donnu how to treat vague statements like this.  You need to be specific
with threat models. Otherwise there's no way to tell whether code is
secure.

> > 
> > > While if we use the indices, we restrict the userspace to only be able
> > > to write to one index only (which is the reset_index).  That's all it
> > > can do to mess things up (and it could never as long as we properly
> > > validate the reset_index when read, which only happens during
> > > KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
> > > indices solution still has its benefits.
> > 
> > So if you mess up index how is this different?
> 
> We can't mess up much with that.  We simply check fetch_index (sorry I
> meant this when I said reset_index, anyway it's the only index that we
> expose to userspace) to make sure:
> 
>   reset_index <= fetch_index <= dirty_index
> 
> Otherwise we fail the ioctl.  With that, we're 100% safe.

safe from what? userspace can mess up guest memory trivially.
for example skip sending some memory or send junk.

> > 
> > I agree RO page kind of feels safer generally though.
> > 
> > I will have to re-read how does the ring works though,
> > my comments were based on the old assumption of mmaped
> > page with indices.
> 
> Yes, sorry again for a bad cover letter.
> 
> It's basically the same as before, just that we only have per-vcpu
> ring now, and the indices are exposed from kvm_run so we don't need
> the extra page, but we still expose that via mmap.

So that's why changelogs are useful.
Can you please write a changelog for this version so I don't
need to re-read all of it? Thanks!

> > 
> > 
> > 
> > > > 
> > > > 
> > > > 
> > > > >  The larger the ring buffer, the less
> > > > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > > > +optimal size depends on the workload, but it is recommended that it be
> > > > > +at least 64 KiB (4096 entries).
> > > > 
> > > > Where's this number coming from? Given you have indices as well,
> > > > 4K size rings is likely to cause cache contention.
> > > 
> > > I think we've had some similar discussion in previous versions on the
> > > size of ring.  Again imho it's really something that may not have a
> > > direct clue as long as it's big enough (4K should be).
> > > 
> > > Regarding to the cache contention: could you explain more?
> > 
> > 4K is a whole cache way. 64K 16 ways.  If there's anything else is a hot
> > path then you are pushing everything out of cache.  To re-read how do
> > indices work so see whether an index is on hot path or not. If yes your
> > structure won't fit in L1 cache which is not great.
> 
> I'm not sure whether I get the point correct, but logically we
> shouldn't read the whole ring buffer as a whole, but only partly (just
> like when we say the ring shouldn't even reach soft-full).  Even if we
> read the whole ring, I don't see a difference here comparing to when
> we read a huge array of data (e.g. "char buf[65536]") in any program
> that covers 64K range - I don't see a good way to fix this but read
> the whole chunk in.  It seems to be common in programs where we have
> big dataset.
> 
> [...]
> 
> > > > > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > > > +{
> > > > > +	u32 cur_slot, next_slot;
> > > > > +	u64 cur_offset, next_offset;
> > > > > +	unsigned long mask;
> > > > > +	u32 fetch;
> > > > > +	int count = 0;
> > > > > +	struct kvm_dirty_gfn *entry;
> > > > > +	struct kvm_dirty_ring_indices *indices = ring->indices;
> > > > > +	bool first_round = true;
> > > > > +
> > > > > +	fetch = READ_ONCE(indices->fetch_index);
> > > > 
> > > > So this does not work if the data cache is virtually tagged.
> > > > Which to the best of my knowledge isn't the case on any
> > > > CPU kvm supports. However it might not stay being the
> > > > case forever. Worth at least commenting.
> > > 
> > > This is the read side.  IIUC even if with virtually tagged archs, we
> > > should do the flushing on the write side rather than the read side,
> > > and that should be enough?
> > 
> > No.
> > See e.g.  Documentation/core-api/cachetlb.rst
> > 
> >   ``void flush_dcache_page(struct page *page)``
> > 
> >         Any time the kernel writes to a page cache page, _OR_
> >         the kernel is about to read from a page cache page and
> >         user space shared/writable mappings of this page potentially
> >         exist, this routine is called.
> 
> But I don't understand why.  I feel like for such arch even the
> userspace must flush cache after publishing data onto shared memories,
> otherwise if the shared memory is between two userspace processes
> they'll get inconsistent state.  Then if with that, I'm confused on
> why the read side needs to flush it again.
> 
> > 
> > 
> > > Also, I believe this is the similar question that Jason has asked in
> > > V2.  Sorry I should mention this earlier, but I didn't address that in
> > > this series because if we need to do so we probably need to do it
> > > kvm-wise, rather than only in this series.
> > 
> > You need to document these things.
> > 
> > >  I feel like it's missing
> > > probably only because all existing KVM supported archs do not have
> > > virtual-tagged caches as you mentioned.
> > 
> > But is that a fact? ARM has such a variety of CPUs,
> > I can't really tell. Did you research this to make sure?
> 
> I didn't.  I only tried to find all callers of flush_dcache_page()
> through the whole Linux tree and I cannot see any kvm related code.
> To make this simple, let me address the dcache flushing issue in the
> next post.
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 20:51       ` Paolo Bonzini
@ 2020-01-09 22:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 22:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, Alex Williamson, kvm, linux-kernel,
	Christophe de Dinechin, Sean Christopherson, Yan Zhao,
	Jason Wang, Kevin Kevin, Vitaly Kuznetsov,
	Dr . David Alan Gilbert, Kirti Wankhede

On Thu, Jan 09, 2020 at 09:51:50PM +0100, Paolo Bonzini wrote:
> On 09/01/20 20:13, Michael S. Tsirkin wrote:
> > That's one of the reasons I called for using something
> > resembling vring_packed_desc.
> 
> In principle it could make sense to use the ring-wrap detection
> mechanism from vring_packed_desc instead of the producer/consumer
> indices.  However, the element address/length indirection is unnecessary.
> 
> Also, unlike virtio, KVM needs to know if there are N free entries (N is
> ~512) before running a guest.  I'm not sure if that is possible with
> ring-wrap counters, while it's trivial with producer/consumer indices.
> 
> Paolo

Yes it's easy: just check whether current entry + 500 has been consumed.
Unless scatter/father is used, but then the answer is simple - just
don't use it :)

-- 
MST


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 19:39           ` Peter Xu
  2020-01-09 20:42             ` Paolo Bonzini
@ 2020-01-09 22:28             ` Michael S. Tsirkin
  2020-01-10 15:10               ` Peter Xu
  1 sibling, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-09 22:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 02:39:49PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > [...]
> > > 
> > > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > > probably use huge pages.
> > > > > 
> > > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > > that high, I believe.  Though, could you elaborate on why huge pages
> > > > > are special here?
> > > > > 
> > > > > Thanks,
> > > > 
> > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > > bit set marks 512 pages as dirty.  We do not take advantage of this
> > > > but it looks like a rather obvious optimization.
> > > 
> > > Right, but isn't that the trade-off between granularity of dirty
> > > tracking and how easy it is to collect the dirty bits?  Say, it'll be
> > > merely impossible to migrate 1G-huge-page-backed guests if we track
> > > dirty bits using huge page granularity, since each touch of guest
> > > memory will cause another 1G memory to be transferred even if most of
> > > the content is the same.  2M can be somewhere in the middle, but still
> > > the same write amplify issue exists.
> > >
> > 
> > OK I see I'm unclear.
> > 
> > IIUC at the moment KVM never uses huge pages if any part of the huge page is
> > tracked.
> 
> To be more precise - I think it's per-memslot.  Say, if the memslot is
> dirty tracked, then no huge page on the host on that memslot (even if
> guest used huge page over that).

Yea ... so does it make sense to make this implementation detail
leak through UAPI?

> > But if all parts of the page are written to then huge page
> > is used.
> 
> I'm not sure of this... I think it's still in 4K granularity.
> 
> > 
> > In this situation the whole huge page is dirty and needs to be migrated.
> 
> Note that in QEMU we always migrate pages in 4K for x86, iiuc (please
> refer to ram_save_host_page() in QEMU).
> 
> > 
> > > PS. that seems to be another topic after all besides the dirty ring
> > > series because we need to change our policy first if we want to track
> > > it with huge pages; with that, for dirty ring we can start to leverage
> > > the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> > > when we really want.
> > > 
> > > Thanks,
> > 
> > Seems like leaking implementation detail to UAPI to me.
> 
> I'd say it's not the only place we have an assumption at least (please
> also refer to uffd_msg.pagefault.address).  IMHO it's not something
> wrong because interfaces can be extended, but I am open to extending
> kvm_dirty_gfn to cover a length/size or make the pad larger (as long
> as Paolo is fine with this).
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 22:28             ` Michael S. Tsirkin
@ 2020-01-10 15:10               ` Peter Xu
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-10 15:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 05:28:36PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 02:39:49PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > > > probably use huge pages.
> > > > > > 
> > > > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > > > that high, I believe.  Though, could you elaborate on why huge pages
> > > > > > are special here?
> > > > > > 
> > > > > > Thanks,
> > > > > 
> > > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > > > bit set marks 512 pages as dirty.  We do not take advantage of this
> > > > > but it looks like a rather obvious optimization.
> > > > 
> > > > Right, but isn't that the trade-off between granularity of dirty
> > > > tracking and how easy it is to collect the dirty bits?  Say, it'll be
> > > > merely impossible to migrate 1G-huge-page-backed guests if we track
> > > > dirty bits using huge page granularity, since each touch of guest
> > > > memory will cause another 1G memory to be transferred even if most of
> > > > the content is the same.  2M can be somewhere in the middle, but still
> > > > the same write amplify issue exists.
> > > >
> > > 
> > > OK I see I'm unclear.
> > > 
> > > IIUC at the moment KVM never uses huge pages if any part of the huge page is
> > > tracked.
> > 
> > To be more precise - I think it's per-memslot.  Say, if the memslot is
> > dirty tracked, then no huge page on the host on that memslot (even if
> > guest used huge page over that).
> 
> Yea ... so does it make sense to make this implementation detail
> leak through UAPI?

I think that's not a leak of internal implementation detail, we just
define the interface as that the address for each kvm_dirty_gfn is
always host page aligned (by default it means no huge page) and point
to a single host page, that's all.  Host page size is always there for
userspace after all so imho it's fine.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 22:18           ` Michael S. Tsirkin
@ 2020-01-10 15:29             ` Peter Xu
  2020-01-12  6:24               ` Michael S. Tsirkin
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-10 15:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 05:18:24PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 03:19:16PM -0500, Peter Xu wrote:
> > > > while for virtio, both sides (hypervisor,
> > > > and the guest driver) are trusted.
> > > 
> > > What gave you the impression guest is trusted in virtio?
> > 
> > Hmm... maybe when I know virtio can bypass vIOMMU as long as it
> > doesn't provide IOMMU_PLATFORM flag? :)
> 
> If guest driver does not provide IOMMU_PLATFORM, and device does,
> then negotiation fails.

I mean it's still possible to specify "!IOMMU_PLATFORM" for the virtio
device even if vIOMMU is enabled in the guest (rather than the
negociation procedures).  Again I think it's fair, just the same
reason as why we tend to even make "iommu=pt" by default for all the
kernel drivers, because we should trust all the drivers as kernel
itself.  The only thing we want to protect using vIOMMU is the
userspace driver because we do have a line between the userspace and
the kernel, and IMHO it's the same thing here for the new kvm
interface.

> 
> > I think it's logical to trust a virtio guest kernel driver, could you
> > guide me on what I've missed?
> 
> 
> guest driver is assumed to be part of guest kernel. It can't
> do anything kernel can't do anyway.

Right, I think all things belongs to the kernel will have the same
level of trust.  However again, userspace should be differently
treated, and that's why I tend to prefer the index solution that we
expose less to userspace to write (read is far safer comparing to
writes from userspace).

> 
> > > 
> > > 
> > > >  Above means we need to do these to
> > > > change to the new design:
> > > > 
> > > >   - Allow the GFN array to be mapped as writable by userspace (so that
> > > >     userspace can publish bit 2),
> > > > 
> > > >   - The userspace must be trusted to follow the design (just imagine
> > > >     what if the userspace overwrites a GFN when it publishes bit 2
> > > >     over a valid dirty gfn entry?  KVM could wrongly unprotect a page
> > > >     for the guest...).
> > > 
> > > You mean protect, right?  So what?
> > 
> > Yes, I mean with that, more things are uncertain from userspace.  It
> > seems easier to me that we restrict the userspace with one index.
> 
> Donnu how to treat vague statements like this.  You need to be specific
> with threat models. Otherwise there's no way to tell whether code is
> secure.
> 
> > > 
> > > > While if we use the indices, we restrict the userspace to only be able
> > > > to write to one index only (which is the reset_index).  That's all it
> > > > can do to mess things up (and it could never as long as we properly
> > > > validate the reset_index when read, which only happens during
> > > > KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
> > > > indices solution still has its benefits.
> > > 
> > > So if you mess up index how is this different?
> > 
> > We can't mess up much with that.  We simply check fetch_index (sorry I
> > meant this when I said reset_index, anyway it's the only index that we
> > expose to userspace) to make sure:
> > 
> >   reset_index <= fetch_index <= dirty_index
> > 
> > Otherwise we fail the ioctl.  With that, we're 100% safe.
> 
> safe from what? userspace can mess up guest memory trivially.
> for example skip sending some memory or send junk.

Yes, QEMU can mess the guest up, but it should never mess the host up,
am I right?  Regarding to QEMU as an userspace, KVM should see it as
untrusted as well from host-wise.  However guest security is another
thing, imho.

> 
> > > 
> > > I agree RO page kind of feels safer generally though.
> > > 
> > > I will have to re-read how does the ring works though,
> > > my comments were based on the old assumption of mmaped
> > > page with indices.
> > 
> > Yes, sorry again for a bad cover letter.
> > 
> > It's basically the same as before, just that we only have per-vcpu
> > ring now, and the indices are exposed from kvm_run so we don't need
> > the extra page, but we still expose that via mmap.
> 
> So that's why changelogs are useful.
> Can you please write a changelog for this version so I don't
> need to re-read all of it? Thanks!

Sure, actually I've got a changelog in the cover letter for this
version [1]... it's:

V3 changelog:

- fail userspace writable maps on dirty ring ranges [Jason]
- commit message fixups [Paolo]
- change __x86_set_memory_region to return hva [Paolo]
- cacheline align for indices [Paolo, Jason]
- drop waitqueue, global lock, etc., include kvmgt rework patchset
- take lock for __x86_set_memory_region() (otherwise it triggers a
  lockdep in latest kvm/queue) [Paolo]
- check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
- one more patch to drop x86_set_memory_region [Paolo]
- one more patch to remove extra srcu usage in init_rmode_identity_map()
- add some r-bs for Paolo

I didn't have detailed changelog for v2 because it could be a long
list with trivial details which can hide the major things, but I've
got a small write-up in the cover letter trying to mention the major
changes [2].

Again, I'm very sorry for either missing a complete changelog in v2,
or the high-level overview of v3 in the cover letter.  I'll make it
better in v4.

Thanks,

[1] https://lore.kernel.org/kvm/20200109145729.32898-1-peterx@redhat.com/
[2] https://lore.kernel.org/kvm/20191220211634.51231-1-peterx@redhat.com/

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
  2020-01-09 16:29   ` Michael S. Tsirkin
@ 2020-01-11  4:49   ` kbuild test robot
  2020-01-11 23:19   ` kbuild test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: kbuild test robot @ 2020-01-11  4:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: kbuild-all, kvm, linux-kernel, Christophe de Dinechin,
	Michael S . Tsirkin, Paolo Bonzini, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert, Lei Cao

[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kvm/linux-next]
[also build test ERROR on next-20200110]
[cannot apply to kvmarm/next vfio/next v5.5-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Peter-Xu/KVM-Dirty-ring-interface/20200110-152053
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: s390-alldefconfig (attached as .config)
compiler: s390-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/s390/../../virt/kvm/kvm_main.o: In function `mark_page_dirty_in_slot':
>> kvm_main.c:(.text+0x4d6): undefined reference to `kvm_dirty_ring_get'
>> kvm_main.c:(.text+0x4f0): undefined reference to `kvm_dirty_ring_push'
   arch/s390/../../virt/kvm/kvm_main.o: In function `kvm_vcpu_init':
>> kvm_main.c:(.text+0x1fe6): undefined reference to `kvm_dirty_ring_alloc'
>> kvm_main.c:(.text+0x204c): undefined reference to `kvm_dirty_ring_free'
   arch/s390/../../virt/kvm/kvm_main.o: In function `kvm_vcpu_uninit':
   kvm_main.c:(.text+0x20c0): undefined reference to `kvm_dirty_ring_free'
   arch/s390/../../virt/kvm/kvm_main.o: In function `kvm_reset_dirty_gfn':
>> kvm_main.c:(.text+0x6650): undefined reference to `kvm_arch_mmu_enable_log_dirty_pt_masked'
   arch/s390/../../virt/kvm/kvm_main.o: In function `kvm_vm_ioctl':
>> kvm_main.c:(.text+0x6b58): undefined reference to `kvm_dirty_ring_reset'

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7828 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
  2020-01-09 16:29   ` Michael S. Tsirkin
  2020-01-11  4:49   ` kbuild test robot
@ 2020-01-11 23:19   ` kbuild test robot
  2020-01-15  6:47   ` Michael S. Tsirkin
  2020-01-16  8:38   ` Michael S. Tsirkin
  4 siblings, 0 replies; 82+ messages in thread
From: kbuild test robot @ 2020-01-11 23:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: kbuild-all, kvm, linux-kernel, Christophe de Dinechin,
	Michael S . Tsirkin, Paolo Bonzini, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, peterx, Dr . David Alan Gilbert, Lei Cao

[-- Attachment #1: Type: text/plain, Size: 1687 bytes --]

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kvm/linux-next]
[also build test ERROR on next-20200110]
[cannot apply to kvmarm/next vfio/next v5.5-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Peter-Xu/KVM-Dirty-ring-interface/20200110-152053
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ERROR: ".kvm_arch_mmu_enable_log_dirty_pt_masked" [arch/powerpc/kvm/kvm.ko] undefined!
>> ERROR: ".kvm_dirty_ring_push" [arch/powerpc/kvm/kvm.ko] undefined!
>> ERROR: ".kvm_dirty_ring_free" [arch/powerpc/kvm/kvm.ko] undefined!
>> ERROR: ".kvm_dirty_ring_get" [arch/powerpc/kvm/kvm.ko] undefined!
>> ERROR: ".kvm_dirty_ring_reset" [arch/powerpc/kvm/kvm.ko] undefined!
>> ERROR: ".kvm_dirty_ring_alloc" [arch/powerpc/kvm/kvm.ko] undefined!

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25639 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-10 15:29             ` Peter Xu
@ 2020-01-12  6:24               ` Michael S. Tsirkin
  0 siblings, 0 replies; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-12  6:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Fri, Jan 10, 2020 at 10:29:59AM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 05:18:24PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 03:19:16PM -0500, Peter Xu wrote:
> > > > > while for virtio, both sides (hypervisor,
> > > > > and the guest driver) are trusted.
> > > > 
> > > > What gave you the impression guest is trusted in virtio?
> > > 
> > > Hmm... maybe when I know virtio can bypass vIOMMU as long as it
> > > doesn't provide IOMMU_PLATFORM flag? :)
> > 
> > If guest driver does not provide IOMMU_PLATFORM, and device does,
> > then negotiation fails.
> 
> I mean it's still possible to specify "!IOMMU_PLATFORM" for the virtio
> device even if vIOMMU is enabled in the guest (rather than the
> negociation procedures).  Again I think it's fair, just the same
> reason as why we tend to even make "iommu=pt" by default for all the
> kernel drivers, because we should trust all the drivers as kernel
> itself.  The only thing we want to protect using vIOMMU is the
> userspace driver because we do have a line between the userspace and
> the kernel, and IMHO it's the same thing here for the new kvm
> interface.
> 
> > 
> > > I think it's logical to trust a virtio guest kernel driver, could you
> > > guide me on what I've missed?
> > 
> > 
> > guest driver is assumed to be part of guest kernel. It can't
> > do anything kernel can't do anyway.
> 
> Right, I think all things belongs to the kernel will have the same
> level of trust.  However again, userspace should be differently
> treated, and that's why I tend to prefer the index solution that we
> expose less to userspace to write (read is far safer comparing to
> writes from userspace).

You are mixing up different userspace kinds here. vIOMMU
prtects guest kernel from guest userspace.
Protecting guest kernel against userspace hypervisors
(e.g. QEMU) is mostly futile.


> > 
> > > > 
> > > > 
> > > > >  Above means we need to do these to
> > > > > change to the new design:
> > > > > 
> > > > >   - Allow the GFN array to be mapped as writable by userspace (so that
> > > > >     userspace can publish bit 2),
> > > > > 
> > > > >   - The userspace must be trusted to follow the design (just imagine
> > > > >     what if the userspace overwrites a GFN when it publishes bit 2
> > > > >     over a valid dirty gfn entry?  KVM could wrongly unprotect a page
> > > > >     for the guest...).
> > > > 
> > > > You mean protect, right?  So what?
> > > 
> > > Yes, I mean with that, more things are uncertain from userspace.  It
> > > seems easier to me that we restrict the userspace with one index.
> > 
> > Donnu how to treat vague statements like this.  You need to be specific
> > with threat models. Otherwise there's no way to tell whether code is
> > secure.
> > 
> > > > 
> > > > > While if we use the indices, we restrict the userspace to only be able
> > > > > to write to one index only (which is the reset_index).  That's all it
> > > > > can do to mess things up (and it could never as long as we properly
> > > > > validate the reset_index when read, which only happens during
> > > > > KVM_RESET_DIRTY_RINGS and is very rare).  From that pov, it seems the
> > > > > indices solution still has its benefits.
> > > > 
> > > > So if you mess up index how is this different?
> > > 
> > > We can't mess up much with that.  We simply check fetch_index (sorry I
> > > meant this when I said reset_index, anyway it's the only index that we
> > > expose to userspace) to make sure:
> > > 
> > >   reset_index <= fetch_index <= dirty_index
> > > 
> > > Otherwise we fail the ioctl.  With that, we're 100% safe.
> > 
> > safe from what? userspace can mess up guest memory trivially.
> > for example skip sending some memory or send junk.
> 
> Yes, QEMU can mess the guest up, but it should never mess the host up,
> am I right?  Regarding to QEMU as an userspace, KVM should see it as
> untrusted as well from host-wise.  However guest security is another
> thing, imho.
> 
> > 
> > > > 
> > > > I agree RO page kind of feels safer generally though.
> > > > 
> > > > I will have to re-read how does the ring works though,
> > > > my comments were based on the old assumption of mmaped
> > > > page with indices.
> > > 
> > > Yes, sorry again for a bad cover letter.
> > > 
> > > It's basically the same as before, just that we only have per-vcpu
> > > ring now, and the indices are exposed from kvm_run so we don't need
> > > the extra page, but we still expose that via mmap.
> > 
> > So that's why changelogs are useful.
> > Can you please write a changelog for this version so I don't
> > need to re-read all of it? Thanks!
> 
> Sure, actually I've got a changelog in the cover letter for this
> version [1]... it's:
> 
> V3 changelog:
> 
> - fail userspace writable maps on dirty ring ranges [Jason]
> - commit message fixups [Paolo]
> - change __x86_set_memory_region to return hva [Paolo]
> - cacheline align for indices [Paolo, Jason]
> - drop waitqueue, global lock, etc., include kvmgt rework patchset
> - take lock for __x86_set_memory_region() (otherwise it triggers a
>   lockdep in latest kvm/queue) [Paolo]
> - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
> - one more patch to drop x86_set_memory_region [Paolo]
> - one more patch to remove extra srcu usage in init_rmode_identity_map()
> - add some r-bs for Paolo
> 
> I didn't have detailed changelog for v2 because it could be a long
> list with trivial details which can hide the major things, but I've
> got a small write-up in the cover letter trying to mention the major
> changes [2].
> 
> Again, I'm very sorry for either missing a complete changelog in v2,
> or the high-level overview of v3 in the cover letter.  I'll make it
> better in v4.
> 
> Thanks,
> 
> [1] https://lore.kernel.org/kvm/20200109145729.32898-1-peterx@redhat.com/
> [2] https://lore.kernel.org/kvm/20191220211634.51231-1-peterx@redhat.com/
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 19:35       ` Michael S. Tsirkin
  2020-01-09 20:19         ` Peter Xu
@ 2020-01-14 20:01         ` Peter Xu
  2020-01-15  6:50           ` Michael S. Tsirkin
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-14 20:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao,
	Andrew Jones

On Thu, Jan 09, 2020 at 02:35:46PM -0500, Michael S. Tsirkin wrote:
>   ``void flush_dcache_page(struct page *page)``
> 
>         Any time the kernel writes to a page cache page, _OR_
>         the kernel is about to read from a page cache page and
>         user space shared/writable mappings of this page potentially
>         exist, this routine is called.
> 
> 
> > Also, I believe this is the similar question that Jason has asked in
> > V2.  Sorry I should mention this earlier, but I didn't address that in
> > this series because if we need to do so we probably need to do it
> > kvm-wise, rather than only in this series.
> 
> You need to document these things.
> 
> >  I feel like it's missing
> > probably only because all existing KVM supported archs do not have
> > virtual-tagged caches as you mentioned.
> 
> But is that a fact? ARM has such a variety of CPUs,
> I can't really tell. Did you research this to make sure?
> 
> > If so, I would prefer if you
> > can allow me to ignore that issue until KVM starts to support such an
> > arch.
> 
> Document limitations pls.  Don't ignore them.

Hi, Michael,

I failed to find a good place to document about flush_dcache_page()
for KVM.  Could you give me a suggestion?

And I don't know about whether there's any ARM hosts that requires
flush_dcache_page().  I think not, because again I didn't see any
caller of flush_dcache_page() in KVM code yet.  Otherwise I think we
should at least call it before the kernel reading kvm_run or after
publishing data to kvm_run.  However I'm also CCing Drew for this.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
                     ` (2 preceding siblings ...)
  2020-01-11 23:19   ` kbuild test robot
@ 2020-01-15  6:47   ` Michael S. Tsirkin
  2020-01-15 15:27     ` Peter Xu
  2020-01-16  8:38   ` Michael S. Tsirkin
  4 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-15  6:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> 
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
> 
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
> 
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
> 
> This patch enables dirty ring for X86 only.  However it should be
> easily extended to other archs as well.
> 
> [1] https://patchwork.kernel.org/patch/10471409/
> 
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  Documentation/virt/kvm/api.txt  |  89 ++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/include/uapi/asm/kvm.h |   1 +
>  arch/x86/kvm/Makefile           |   3 +-
>  arch/x86/kvm/mmu/mmu.c          |   6 ++
>  arch/x86/kvm/vmx/vmx.c          |   7 ++
>  arch/x86/kvm/x86.c              |   9 ++
>  include/linux/kvm_dirty_ring.h  |  55 +++++++++++
>  include/linux/kvm_host.h        |  26 +++++
>  include/trace/events/kvm.h      |  78 +++++++++++++++
>  include/uapi/linux/kvm.h        |  33 +++++++
>  virt/kvm/dirty_ring.c           | 162 ++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             | 137 ++++++++++++++++++++++++++-
>  13 files changed, 606 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  create mode 100644 virt/kvm/dirty_ring.c
> 
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index ebb37b34dcfc..708c3e0f7eae 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>  It is thus encouraged to use the vm ioctl to query for capabilities (available
>  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>  
> +
>  4.5 KVM_GET_VCPU_MMAP_SIZE
>  
>  Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>  memory region.  This ioctl returns the size of that region.  See the
>  KVM_RUN documentation for details.
>  
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.
> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>  
>  4.6 KVM_SET_MEMORY_REGION
>  
> @@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>  
>  See KVM_CAP_VCPU_EVENTS for more details.
> +
>  8.20 KVM_CAP_HYPERV_SEND_IPI
>  
>  Architectures: x86
> @@ -5383,6 +5397,7 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>  
>  Architecture: x86
> @@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>  flush hypercalls by Hyper-V) so userspace should disable KVM identification
>  in CPUID and only exposes Hyper-V identification. In this case, guest
>  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu.
> +
> +One dirty ring is defined as below internally:
> +
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +	struct kvm_dirty_ring_indices *indices;
> +	int index;
> +};
> +
> +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> +For each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {
> +        __u32 pad;
> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};
> +
> +Most of the ring structure is used by KVM internally, while only the
> +indices are exposed to userspace:
> +
> +struct kvm_dirty_ring_indices {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
> +The two indices in the ring buffer are free running counters.
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.
> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer (which was previously
> +done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one needs to
> +kick the vcpu out for a hardware buffer flush (vmexit) to make sure
> +all the existing dirty gfns are flushed to the dirty rings.
> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the KVM_RUN
> +ioctl will return to userspace with zero.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f536d139b3d2..3fe18402e6a3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1181,6 +1181,7 @@ struct kvm_x86_ops {
>  					   struct kvm_memory_slot *slot,
>  					   gfn_t offset, unsigned long mask);
>  	int (*write_log_dirty)(struct kvm_vcpu *vcpu);
> +	int (*cpu_dirty_log_size)(void);
>  
>  	/* pmu operations of sub-arch */
>  	const struct kvm_pmu_ops *pmu_ops;
> @@ -1666,4 +1667,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define GET_SMSTATE(type, buf, offset)		\
>  	(*(type *)((buf) + (offset) - 0x7e00))
>  
> +int kvm_cpu_dirty_log_size(void);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 503d3f42da16..b59bf356c478 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -12,6 +12,7 @@
>  
>  #define KVM_PIO_PAGE_OFFSET 1
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>  
>  #define DE_VECTOR 0
>  #define DB_VECTOR 1
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>  KVM := ../../../virt/kvm
>  
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>  
>  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7269130ea5e2..621b842a9b7b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1832,7 +1832,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
>  {
>  	if (kvm_x86_ops->write_log_dirty)
>  		return kvm_x86_ops->write_log_dirty(vcpu);
> +	return 0;
> +}
>  
> +int kvm_cpu_dirty_log_size(void)
> +{
> +	if (kvm_x86_ops->cpu_dirty_log_size)
> +		return kvm_x86_ops->cpu_dirty_log_size();
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 62175a246bcc..2151de89456d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7689,6 +7689,7 @@ static __init int hardware_setup(void)
>  		kvm_x86_ops->slot_disable_log_dirty = NULL;
>  		kvm_x86_ops->flush_log_dirty = NULL;
>  		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
> +		kvm_x86_ops->cpu_dirty_log_size = NULL;
>  	}
>  
>  	if (!cpu_has_vmx_preemption_timer())
> @@ -7753,6 +7754,11 @@ static __exit void hardware_unsetup(void)
>  	free_kvm_area();
>  }
>  
> +static int vmx_cpu_dirty_log_size(void)
> +{
> +	return enable_pml ? PML_ENTITY_NUM : 0;
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.cpu_has_kvm_support = cpu_has_kvm_support,
>  	.disabled_by_bios = vmx_disabled_by_bios,
> @@ -7875,6 +7881,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.flush_log_dirty = vmx_flush_log_dirty,
>  	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
>  	.write_log_dirty = vmx_write_pml_buffer,
> +	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
>  
>  	.pre_block = vmx_pre_block,
>  	.post_block = vmx_post_block,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff97782b3919..9c3673592826 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7998,6 +7998,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	bool req_immediate_exit = false;
>  
> +	/* Forbid vmenter if vcpu dirty ring is soft-full */
> +	if (unlikely(vcpu->kvm->dirty_ring_size &&
> +		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> +		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +		trace_kvm_dirty_ring_exit(vcpu);
> +		r = 0;
> +		goto out;
> +	}
> +
>  	if (kvm_request_pending(vcpu)) {
>  		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu)) {
>  			if (unlikely(!kvm_x86_ops->get_vmcs12_pages(vcpu))) {
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..d6fe9e1b7617
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,55 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/**
> + * kvm_dirty_ring: KVM internal dirty ring structure
> + *
> + * @dirty_index: free running counter that points to the next slot in
> + *               dirty_ring->dirty_gfns, where a new dirty page should go
> + * @reset_index: free running counter that points to the next dirty page
> + *               in dirty_ring->dirty_gfns for which dirty trap needs to
> + *               be reenabled
> + * @size:        size of the compact list, dirty_ring->dirty_gfns
> + * @soft_limit:  when the number of dirty pages in the list reaches this
> + *               limit, vcpu that owns this ring should exit to userspace
> + *               to allow userspace to harvest all the dirty pages
> + * @dirty_gfns:  the array to keep the dirty gfns
> + * @indices:     the pointer to the @kvm_dirty_ring_indices structure
> + *               of this specific ring
> + * @index:       index of this dirty ring
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	struct kvm_dirty_gfn *dirty_gfns;

Here would be a good place to document that accessing
shared page like this is only safe if archotecture is physically
tagged.

> +	struct kvm_dirty_ring_indices *indices;
> +	int index;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indices *indices,
> +			 int index, u32 size);
> +struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * returns =0: successfully pushed
> + *         <0: unable to push, need to wait
> + */
> +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cbd633ece959..c96161c6a0c9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>  #include <linux/kvm_types.h>
>  
>  #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>  
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -321,6 +322,7 @@ struct kvm_vcpu {
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>  };
>  
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -502,6 +504,7 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	u32 dirty_ring_size;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -831,6 +834,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  					gfn_t gfn_offset,
>  					unsigned long mask);
>  
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  				struct kvm_dirty_log *log);
>  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1409,4 +1414,25 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>  
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures, while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +
>  #endif
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 2c735a3e6613..3d850997940c 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -399,6 +399,84 @@ TRACE_EVENT(kvm_halt_poll_ns,
>  #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
>  	trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
>  
> +TRACE_EVENT(kvm_dirty_ring_push,
> +	TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
> +	TP_ARGS(ring, slot, offset),
> +
> +	TP_STRUCT__entry(
> +		__field(int, index)
> +		__field(u32, dirty_index)
> +		__field(u32, reset_index)
> +		__field(u32, slot)
> +		__field(u64, offset)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index          = ring->index;
> +		__entry->dirty_index    = ring->dirty_index;
> +		__entry->reset_index    = ring->reset_index;
> +		__entry->slot           = slot;
> +		__entry->offset         = offset;
> +	),
> +
> +	TP_printk("ring %d: dirty 0x%x reset 0x%x "
> +		  "slot %u offset 0x%llx (used %u)",
> +		  __entry->index, __entry->dirty_index,
> +		  __entry->reset_index,  __entry->slot, __entry->offset,
> +		  __entry->dirty_index - __entry->reset_index)
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_reset,
> +	TP_PROTO(struct kvm_dirty_ring *ring),
> +	TP_ARGS(ring),
> +
> +	TP_STRUCT__entry(
> +		__field(int, index)
> +		__field(u32, dirty_index)
> +		__field(u32, reset_index)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index          = ring->index;
> +		__entry->dirty_index    = ring->dirty_index;
> +		__entry->reset_index    = ring->reset_index;
> +	),
> +
> +	TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
> +		  __entry->index, __entry->dirty_index, __entry->reset_index,
> +		  __entry->dirty_index - __entry->reset_index)
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_waitqueue,
> +	TP_PROTO(bool enter),
> +	TP_ARGS(enter),
> +
> +	TP_STRUCT__entry(
> +	    __field(bool, enter)
> +	),
> +
> +	TP_fast_assign(
> +	    __entry->enter = enter;
> +	),
> +
> +	TP_printk("%s", __entry->enter ? "wait" : "awake")
> +);
> +
> +TRACE_EVENT(kvm_dirty_ring_exit,
> +	TP_PROTO(struct kvm_vcpu *vcpu),
> +	TP_ARGS(vcpu),
> +
> +	TP_STRUCT__entry(
> +	    __field(int, vcpu_id)
> +	),
> +
> +	TP_fast_assign(
> +	    __entry->vcpu_id = vcpu->vcpu_id;
> +	),
> +
> +	TP_printk("vcpu %d", __entry->vcpu_id)
> +);
> +
>  #endif /* _TRACE_KVM_MAIN_H */
>  
>  /* This part must be outside protection */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f0a16b4adbbd..df4a1700ff1e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_DIRTY_RING_FULL  29
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -247,6 +248,13 @@ struct kvm_hyperv_exit {
>  /* Encounter unexpected vm-exit reason */
>  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>  
> +struct kvm_dirty_ring_indices {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 padding1;
> +	__u32 fetch_index; /* set by userspace */
> +	__u32 padding2;
> +};
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -421,6 +429,8 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	struct kvm_dirty_ring_indices vcpu_ring_indices;
>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1019,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
>  #define KVM_CAP_ARM_NISV_TO_USER 177
>  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1473,6 +1484,9 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>  
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> @@ -1623,4 +1637,23 @@ struct kvm_hyperv_eventfd {
>  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>  
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.


Are these requirement from an architecture? Then you want to move
this out of UAPI, keep things relevant to userspace there.

> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +

Pls add comments about how kvm_dirty_gfn must be mmapped.


>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..67ec5bbc21c0
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,162 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * KVM dirty ring implementation
> + *
> + * Copyright 2019 Red Hat, Inc.
> + */
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +#include <trace/events/kvm.h>
> +
> +int __weak kvm_cpu_dirty_log_size(void)
> +{
> +	return 0;
> +}
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
> +}
> +
> +bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +
> +	WARN_ON_ONCE(vcpu->kvm != kvm);
> +
> +	return &vcpu->dirty_ring;
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indices *indices,
> +			 int index, u32 size)
> +{
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	ring->index = index;
> +	ring->indices = indices;
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +	bool first_round = true;
> +
> +	fetch = READ_ONCE(indices->fetch_index);
> +
> +	/*
> +	 * Note that fetch_index is written by the userspace, which
> +	 * should not be trusted.  If this happens, then it's probably
> +	 * that the userspace has written a wrong fetch_index.
> +	 */
> +	if (fetch - ring->reset_index > ring->size)
> +		return -EINVAL;
> +
> +	if (fetch == ring->reset_index)
> +		return 0;
> +
> +	/* This is only needed to make compilers happy */
> +	cur_slot = cur_offset = mask = 0;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);
> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.
> +		 */
> +		if (!first_round && next_slot == cur_slot) {
> +			s64 delta = next_offset - cur_offset;
> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +		first_round = false;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +	trace_kvm_dirty_ring_reset(ring);
> +
> +	return count;
> +}
> +
> +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> +{
> +	struct kvm_dirty_gfn *entry;
> +	struct kvm_dirty_ring_indices *indices = ring->indices;
> +
> +	/* It should never get full */
> +	WARN_ON_ONCE(kvm_dirty_ring_full(ring));
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	/*
> +	 * Make sure the data is filled in before we publish this to
> +	 * the userspace program.  There's no paired kernel-side reader.
> +	 */
> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indices->avail_index, ring->dirty_index);
> +
> +	trace_kvm_dirty_ring_push(ring, slot, offset);
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	vfree(ring->dirty_gfns);
> +	ring->dirty_gfns = NULL;
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5bbd8b8730fa..5e36792e15ae 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>  
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>  
> @@ -357,11 +359,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>  
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
> +					 &vcpu->run->vcpu_ring_indices,
> +					 id, kvm->dirty_ring_size);
> +		if (r)
> +			goto fail_free_run;
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>  
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -379,6 +392,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>  
> @@ -2284,8 +2299,13 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> +		u32 slot = (memslot->as_id << 16) | memslot->id;
>  
> -		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> +		if (kvm->dirty_ring_size)
> +			kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
> +					    slot, rel_gfn);
> +		else
> +			set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
>  
> @@ -2632,6 +2652,16 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>  
> +static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
> +{
> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
> +		return false;
> +
> +	return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2647,6 +2677,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>  #endif
> +	else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>  	else
>  		return kvm_arch_vcpu_fault(vcpu, vmf);
>  	get_page(page);
> @@ -2660,6 +2694,15 @@ static const struct vm_operations_struct kvm_vcpu_vm_ops = {
>  
>  static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	struct kvm_vcpu *vcpu = file->private_data;
> +	unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +
> +	/* If to map any writable page within dirty ring, fail it */
> +	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> +	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> +	    vma->vm_flags & VM_WRITE)
> +		return -EINVAL;
> +
>  	vma->vm_ops = &kvm_vcpu_vm_ops;
>  	return 0;
>  }
> @@ -3242,12 +3285,97 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #endif
>  	case KVM_CAP_NR_MEMSLOTS:
>  		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +#ifdef CONFIG_X86
> +		return KVM_DIRTY_RING_MAX_ENTRIES;
> +#else
> +		return 0;
> +#endif
>  	default:
>  		break;
>  	}
>  	return kvm_vm_ioctl_check_extension(kvm, arg);
>  }
>  
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	if (!KVM_DIRTY_LOG_PAGE_OFFSET)
> +		return -EINVAL;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;
> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = 0;
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	return cleared;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -3265,6 +3393,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		kvm->manual_dirty_log_protect = cap->args[0];
>  		return 0;
>  #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -3452,6 +3582,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  	case KVM_CHECK_EXTENSION:
>  		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>  		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
> -- 
> 2.24.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-14 20:01         ` Peter Xu
@ 2020-01-15  6:50           ` Michael S. Tsirkin
  2020-01-15 15:20             ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-15  6:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao,
	Andrew Jones

On Tue, Jan 14, 2020 at 03:01:34PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 02:35:46PM -0500, Michael S. Tsirkin wrote:
> >   ``void flush_dcache_page(struct page *page)``
> > 
> >         Any time the kernel writes to a page cache page, _OR_
> >         the kernel is about to read from a page cache page and
> >         user space shared/writable mappings of this page potentially
> >         exist, this routine is called.
> > 
> > 
> > > Also, I believe this is the similar question that Jason has asked in
> > > V2.  Sorry I should mention this earlier, but I didn't address that in
> > > this series because if we need to do so we probably need to do it
> > > kvm-wise, rather than only in this series.
> > 
> > You need to document these things.
> > 
> > >  I feel like it's missing
> > > probably only because all existing KVM supported archs do not have
> > > virtual-tagged caches as you mentioned.
> > 
> > But is that a fact? ARM has such a variety of CPUs,
> > I can't really tell. Did you research this to make sure?
> > 
> > > If so, I would prefer if you
> > > can allow me to ignore that issue until KVM starts to support such an
> > > arch.
> > 
> > Document limitations pls.  Don't ignore them.
> 
> Hi, Michael,
> 
> I failed to find a good place to document about flush_dcache_page()
> for KVM.  Could you give me a suggestion?

Maybe where the field is introduced. I posted the suggestions to the
relevant patch.

> And I don't know about whether there's any ARM hosts that requires
> flush_dcache_page().  I think not, because again I didn't see any
> caller of flush_dcache_page() in KVM code yet.  Otherwise I think we
> should at least call it before the kernel reading kvm_run or after
> publishing data to kvm_run.

But is kvm run ever accessed while VCPU is running on another CPU?
I always assumed no but maybe I'm missing something?

>  However I'm also CCing Drew for this.
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-15  6:50           ` Michael S. Tsirkin
@ 2020-01-15 15:20             ` Peter Xu
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-15 15:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao,
	Andrew Jones

On Wed, Jan 15, 2020 at 01:50:08AM -0500, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2020 at 03:01:34PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 02:35:46PM -0500, Michael S. Tsirkin wrote:
> > >   ``void flush_dcache_page(struct page *page)``
> > > 
> > >         Any time the kernel writes to a page cache page, _OR_
> > >         the kernel is about to read from a page cache page and
> > >         user space shared/writable mappings of this page potentially
> > >         exist, this routine is called.

[1]

> > > 
> > > 
> > > > Also, I believe this is the similar question that Jason has asked in
> > > > V2.  Sorry I should mention this earlier, but I didn't address that in
> > > > this series because if we need to do so we probably need to do it
> > > > kvm-wise, rather than only in this series.
> > > 
> > > You need to document these things.
> > > 
> > > >  I feel like it's missing
> > > > probably only because all existing KVM supported archs do not have
> > > > virtual-tagged caches as you mentioned.
> > > 
> > > But is that a fact? ARM has such a variety of CPUs,
> > > I can't really tell. Did you research this to make sure?
> > > 
> > > > If so, I would prefer if you
> > > > can allow me to ignore that issue until KVM starts to support such an
> > > > arch.
> > > 
> > > Document limitations pls.  Don't ignore them.
> > 
> > Hi, Michael,
> > 
> > I failed to find a good place to document about flush_dcache_page()
> > for KVM.  Could you give me a suggestion?
> 
> Maybe where the field is introduced. I posted the suggestions to the
> relevant patch.

(will reply there)

> 
> > And I don't know about whether there's any ARM hosts that requires
> > flush_dcache_page().  I think not, because again I didn't see any
> > caller of flush_dcache_page() in KVM code yet.  Otherwise I think we
> > should at least call it before the kernel reading kvm_run or after
> > publishing data to kvm_run.
> 
> But is kvm run ever accessed while VCPU is running on another CPU?
> I always assumed no but maybe I'm missing something?

IMHO we need to call it even if it's running on the same CPU - please
refer to [1] above, there's no restriction on which CPU the code is
running on.  I think it makes sense, especially the systems for
virtually-tagged caches because even if the memory accesses happened
on the same CPU, the virtual addresses to access the same page could
still be different when accessed from kernel/userspace.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-15  6:47   ` Michael S. Tsirkin
@ 2020-01-15 15:27     ` Peter Xu
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-15 15:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Wed, Jan 15, 2020 at 01:47:15AM -0500, Michael S. Tsirkin wrote:
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..d6fe9e1b7617
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,55 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/**
> > + * kvm_dirty_ring: KVM internal dirty ring structure
> > + *
> > + * @dirty_index: free running counter that points to the next slot in
> > + *               dirty_ring->dirty_gfns, where a new dirty page should go
> > + * @reset_index: free running counter that points to the next dirty page
> > + *               in dirty_ring->dirty_gfns for which dirty trap needs to
> > + *               be reenabled
> > + * @size:        size of the compact list, dirty_ring->dirty_gfns
> > + * @soft_limit:  when the number of dirty pages in the list reaches this
> > + *               limit, vcpu that owns this ring should exit to userspace
> > + *               to allow userspace to harvest all the dirty pages
> > + * @dirty_gfns:  the array to keep the dirty gfns
> > + * @indices:     the pointer to the @kvm_dirty_ring_indices structure
> > + *               of this specific ring
> > + * @index:       index of this dirty ring
> > + */
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> 
> Here would be a good place to document that accessing
> shared page like this is only safe if archotecture is physically
> tagged.

Right, more importantly is where to document for kvm_run, and any
other shared mappings that I'm not yet aware of across the whole KVM.

[...]

> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + *    of kvm_write_* so that the global dirty ring is not filled up
> > + *    too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + *    enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + *    dirty bitmap with KVM's.
> 
> 
> Are these requirement from an architecture? Then you want to move
> this out of UAPI, keep things relevant to userspace there.

Good point, I removed it, and instead of this...

> 
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > +	__u32 pad;
> > +	__u32 slot;
> > +	__u64 offset;
> > +};
> > +
> 
> Pls add comments about how kvm_dirty_gfn must be mmapped.

... I added this:

/*
 * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
 * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn.  The
 * size of the gfn buffer is decided by the first argument when
 * enabling KVM_CAP_DIRTY_LOG_RING.
 */

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
                     ` (3 preceding siblings ...)
  2020-01-15  6:47   ` Michael S. Tsirkin
@ 2020-01-16  8:38   ` Michael S. Tsirkin
  2020-01-16 16:27     ` Peter Xu
  4 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-16  8:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> +	/* If to map any writable page within dirty ring, fail it */
> +	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> +	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> +	    vma->vm_flags & VM_WRITE)
> +		return -EINVAL;

Worth thinking about other flags. Do we want to force VM_SHARED?
Disable VM_EXEC?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-16  8:38   ` Michael S. Tsirkin
@ 2020-01-16 16:27     ` Peter Xu
  2020-01-17  9:50       ` Michael S. Tsirkin
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-16 16:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 16, 2020 at 03:38:21AM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > +	/* If to map any writable page within dirty ring, fail it */
> > +	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> > +	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> > +	    vma->vm_flags & VM_WRITE)
> > +		return -EINVAL;
> 
> Worth thinking about other flags. Do we want to force VM_SHARED?
> Disable VM_EXEC?

Makes sense to me.  I think it worths a standalone patch since they
should apply for the whole per-vcpu mmaped regions rather than only
for the dirty ring buffers.

(Should include KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET,
 KVM_S390_SIE_PAGE_OFFSET, kvm_run, and this new one)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-16 16:27     ` Peter Xu
@ 2020-01-17  9:50       ` Michael S. Tsirkin
  2020-01-20  6:48         ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-17  9:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Thu, Jan 16, 2020 at 11:27:03AM -0500, Peter Xu wrote:
> On Thu, Jan 16, 2020 at 03:38:21AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > > +	/* If to map any writable page within dirty ring, fail it */
> > > +	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> > > +	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> > > +	    vma->vm_flags & VM_WRITE)
> > > +		return -EINVAL;
> > 
> > Worth thinking about other flags. Do we want to force VM_SHARED?
> > Disable VM_EXEC?
> 
> Makes sense to me.  I think it worths a standalone patch since they
> should apply for the whole per-vcpu mmaped regions rather than only
> for the dirty ring buffers.
> 
> (Should include KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET,
>  KVM_S390_SIE_PAGE_OFFSET, kvm_run, and this new one)
> 
> Thanks,


I don't think we can change UAPI for existing ones.
Userspace might be setting these by mistake.

> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
@ 2020-01-19  9:01   ` Paolo Bonzini
  2020-01-20  6:45     ` Peter Xu
  2020-01-21 15:56   ` Sean Christopherson
  1 sibling, 1 reply; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-19  9:01 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On 09/01/20 15:57, Peter Xu wrote:
> -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> +/*
> + * If `uaddr' is specified, `*uaddr' will be returned with the
> + * userspace address that was just allocated.  `uaddr' is only
> + * meaningful if the function returns zero, and `uaddr' will only be
> + * valid when with either the slots_lock or with the SRCU read lock
> + * held.  After we release the lock, the returned `uaddr' will be invalid.
> + */

In practice the address is still protected by the refcount, isn't it?
Only destroying the VM could invalidate it.

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-09 19:15     ` Peter Xu
  2020-01-09 19:35       ` Michael S. Tsirkin
@ 2020-01-19  9:09       ` Paolo Bonzini
  2020-01-19 10:12         ` Michael S. Tsirkin
  1 sibling, 1 reply; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-19  9:09 UTC (permalink / raw)
  To: Peter Xu, Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On 09/01/20 20:15, Peter Xu wrote:
> Regarding dropping the indices: I feel like it can be done, though we
> probably need two extra bits for each GFN entry, for example:
> 
>   - Bit 0 of the GFN address to show whether this is a valid publish
>     of dirty gfn
> 
>   - Bit 1 of the GFN address to show whether this is collected by the
>     user

We can use bit 62 and 63 of the GFN.

I think this can be done in a secure way.  Later in the thread you say:

> We simply check fetch_index (sorry I
> meant this when I said reset_index, anyway it's the only index that we
> expose to userspace) to make sure:
> 
>   reset_index <= fetch_index <= dirty_index

So this means that KVM_RESET_DIRTY_RINGS should only test the "collected
by user" flag on dirty ring entries between reset_index and dirty_index.

Also I would make it

   00b (invalid GFN) ->
     01b (valid gfn published by kernel, which is dirty) ->
       1*b (gfn dirty page collected by userspace) ->
         00b (gfn reset by kernel, so goes back to invalid gfn)
That is 10b and 11b are equivalent.  The kernel doesn't read that bit if
userspace has collected the page.

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 00/21] KVM: Dirty ring interface
  2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
                   ` (22 preceding siblings ...)
  2020-01-09 16:47 ` Alex Williamson
@ 2020-01-19  9:11 ` Paolo Bonzini
  23 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-19  9:11 UTC (permalink / raw)
  To: Peter Xu, kvm, linux-kernel
  Cc: Christophe de Dinechin, Michael S . Tsirkin, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On 09/01/20 15:57, Peter Xu wrote:
> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> (based on kvm/queue)
> 
> Please refer to either the previous cover letters, or documentation
> update in patch 12 for the big picture.  Previous posts:
> 
> V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> 
> The major change in V3 is that we dropped the whole waitqueue and the
> global lock. With that, we have clean per-vcpu ring and no default
> ring any more.  The two kvmgt refactoring patches were also included
> to show the dependency of the works.
> 
> Patchset layout:
> 
> Patch 1-2:         Picked up from kvmgt refactoring
> Patch 3-6:         Small patches that are not directly related,
>                    (So can be acked/nacked/picked as standalone)
> Patch 7-11:        Prepares for the dirty ring interface
> Patch 12:          Major implementation
> Patch 13-14:       Quick follow-ups for patch 8
> Patch 15-21:       Test cases
> 
> V3 changelog:
> 
> - fail userspace writable maps on dirty ring ranges [Jason]
> - commit message fixups [Paolo]
> - change __x86_set_memory_region to return hva [Paolo]
> - cacheline align for indices [Paolo, Jason]
> - drop waitqueue, global lock, etc., include kvmgt rework patchset
> - take lock for __x86_set_memory_region() (otherwise it triggers a
>   lockdep in latest kvm/queue) [Paolo]
> - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
> - one more patch to drop x86_set_memory_region [Paolo]
> - one more patch to remove extra srcu usage in init_rmode_identity_map()
> - add some r-bs for Paolo
> 
> Please review, thanks.
> 
> Paolo Bonzini (1):
>   KVM: Move running VCPU from ARM to common code
> 
> Peter Xu (18):
>   KVM: Remove kvm_read_guest_atomic()
>   KVM: Add build-time error check on kvm_run size
>   KVM: X86: Change parameter for fast_page_fault tracepoint
>   KVM: X86: Don't take srcu lock in init_rmode_identity_map()
>   KVM: Cache as_id in kvm_memory_slot
>   KVM: X86: Drop x86_set_memory_region()
>   KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
>   KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
>   KVM: X86: Implement ring-based dirty memory tracking
>   KVM: Make dirty ring exclusive to dirty bitmap log
>   KVM: Don't allocate dirty bitmap if dirty ring is enabled
>   KVM: selftests: Always clear dirty bitmap after iteration
>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>   KVM: selftests: Use a single binary for dirty/clear log test
>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>   KVM: selftests: Add dirty ring buffer test
>   KVM: selftests: Let dirty_log_test async for dirty ring test
>   KVM: selftests: Add "-c" parameter to dirty log test
> 
> Yan Zhao (2):
>   vfio: introduce vfio_iova_rw to read/write a range of IOVAs
>   drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw
> 
>  Documentation/virt/kvm/api.txt                |  96 ++++
>  arch/arm/include/asm/kvm_host.h               |   2 -
>  arch/arm64/include/asm/kvm_host.h             |   2 -
>  arch/x86/include/asm/kvm_host.h               |   7 +-
>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>  arch/x86/kvm/Makefile                         |   3 +-
>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>  arch/x86/kvm/mmutrace.h                       |   9 +-
>  arch/x86/kvm/svm.c                            |   3 +-
>  arch/x86/kvm/vmx/vmx.c                        |  86 ++--
>  arch/x86/kvm/x86.c                            |  43 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c              |  25 +-
>  drivers/vfio/vfio.c                           |  45 ++
>  drivers/vfio/vfio_iommu_type1.c               |  81 ++++
>  include/linux/kvm_dirty_ring.h                |  55 +++
>  include/linux/kvm_host.h                      |  37 +-
>  include/linux/vfio.h                          |   5 +
>  include/trace/events/kvm.h                    |  78 ++++
>  include/uapi/linux/kvm.h                      |  33 ++
>  tools/include/uapi/linux/kvm.h                |  38 ++
>  tools/testing/selftests/kvm/Makefile          |   2 -
>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>  tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   4 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  72 +++
>  .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
>  virt/kvm/arm/arch_timer.c                     |   2 +-
>  virt/kvm/arm/arm.c                            |  29 --
>  virt/kvm/arm/perf.c                           |   6 +-
>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>  virt/kvm/dirty_ring.c                         | 162 +++++++
>  virt/kvm/kvm_main.c                           | 215 +++++++--
>  32 files changed, 1379 insertions(+), 208 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>  create mode 100644 virt/kvm/dirty_ring.c
> 

Queued patches 3-6, 8-9, 11; thanks!

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-19  9:09       ` Paolo Bonzini
@ 2020-01-19 10:12         ` Michael S. Tsirkin
  2020-01-20  7:29           ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-19 10:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, kvm, linux-kernel, Christophe de Dinechin,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Sun, Jan 19, 2020 at 10:09:53AM +0100, Paolo Bonzini wrote:
> On 09/01/20 20:15, Peter Xu wrote:
> > Regarding dropping the indices: I feel like it can be done, though we
> > probably need two extra bits for each GFN entry, for example:
> > 
> >   - Bit 0 of the GFN address to show whether this is a valid publish
> >     of dirty gfn
> > 
> >   - Bit 1 of the GFN address to show whether this is collected by the
> >     user
> 
> We can use bit 62 and 63 of the GFN.

If we are short on bits we can just use 1 bit. E.g. set if
userspace has collected the GFN.

> I think this can be done in a secure way.  Later in the thread you say:
> 
> > We simply check fetch_index (sorry I
> > meant this when I said reset_index, anyway it's the only index that we
> > expose to userspace) to make sure:
> > 
> >   reset_index <= fetch_index <= dirty_index
> 
> So this means that KVM_RESET_DIRTY_RINGS should only test the "collected
> by user" flag on dirty ring entries between reset_index and dirty_index.
> 
> Also I would make it
> 
>    00b (invalid GFN) ->
>      01b (valid gfn published by kernel, which is dirty) ->
>        1*b (gfn dirty page collected by userspace) ->
>          00b (gfn reset by kernel, so goes back to invalid gfn)
> That is 10b and 11b are equivalent.  The kernel doesn't read that bit if
> userspace has collected the page.
> 
> Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-19  9:01   ` Paolo Bonzini
@ 2020-01-20  6:45     ` Peter Xu
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-20  6:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Sun, Jan 19, 2020 at 10:01:50AM +0100, Paolo Bonzini wrote:
> On 09/01/20 15:57, Peter Xu wrote:
> > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > +/*
> > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > + * userspace address that was just allocated.  `uaddr' is only
> > + * meaningful if the function returns zero, and `uaddr' will only be
> > + * valid when with either the slots_lock or with the SRCU read lock
> > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > + */
> 
> In practice the address is still protected by the refcount, isn't it?
> Only destroying the VM could invalidate it.

Yes I think so.  I wanted to make it clear that uaddr is temporary,
however "will be invalid" could be be too strong...  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-17  9:50       ` Michael S. Tsirkin
@ 2020-01-20  6:48         ` Peter Xu
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-20  6:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Paolo Bonzini,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Fri, Jan 17, 2020 at 04:50:48AM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 16, 2020 at 11:27:03AM -0500, Peter Xu wrote:
> > On Thu, Jan 16, 2020 at 03:38:21AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > > > +	/* If to map any writable page within dirty ring, fail it */
> > > > +	if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
> > > > +	     kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
> > > > +	    vma->vm_flags & VM_WRITE)
> > > > +		return -EINVAL;
> > > 
> > > Worth thinking about other flags. Do we want to force VM_SHARED?
> > > Disable VM_EXEC?
> > 
> > Makes sense to me.  I think it worths a standalone patch since they
> > should apply for the whole per-vcpu mmaped regions rather than only
> > for the dirty ring buffers.
> > 
> > (Should include KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET,
> >  KVM_S390_SIE_PAGE_OFFSET, kvm_run, and this new one)
> > 
> > Thanks,
> 
> 
> I don't think we can change UAPI for existing ones.
> Userspace might be setting these by mistake.

Right (especially for VM_EXEC)... I'll only check that for the new
pages then.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-19 10:12         ` Michael S. Tsirkin
@ 2020-01-20  7:29           ` Peter Xu
  2020-01-20  7:47             ` Michael S. Tsirkin
  2020-01-21 10:24             ` Paolo Bonzini
  0 siblings, 2 replies; 82+ messages in thread
From: Peter Xu @ 2020-01-20  7:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, kvm, linux-kernel, Christophe de Dinechin,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Sun, Jan 19, 2020 at 05:12:35AM -0500, Michael S. Tsirkin wrote:
> On Sun, Jan 19, 2020 at 10:09:53AM +0100, Paolo Bonzini wrote:
> > On 09/01/20 20:15, Peter Xu wrote:
> > > Regarding dropping the indices: I feel like it can be done, though we
> > > probably need two extra bits for each GFN entry, for example:
> > > 
> > >   - Bit 0 of the GFN address to show whether this is a valid publish
> > >     of dirty gfn
> > > 
> > >   - Bit 1 of the GFN address to show whether this is collected by the
> > >     user
> > 
> > We can use bit 62 and 63 of the GFN.
> 
> If we are short on bits we can just use 1 bit. E.g. set if
> userspace has collected the GFN.

I'm still unsure whether we can use only one bit for this.  Say,
otherwise how does the userspace knows the entry is valid?  For
example, the entry with all zeros ({.slot = 0, gfn = 0}) could be
recognized as a valid dirty page on slot 0 gfn 0, even if it's
actually an unused entry.

> 
> > I think this can be done in a secure way.  Later in the thread you say:
> > 
> > > We simply check fetch_index (sorry I
> > > meant this when I said reset_index, anyway it's the only index that we
> > > expose to userspace) to make sure:
> > > 
> > >   reset_index <= fetch_index <= dirty_index
> > 
> > So this means that KVM_RESET_DIRTY_RINGS should only test the "collected
> > by user" flag on dirty ring entries between reset_index and dirty_index.
> > 
> > Also I would make it
> > 
> >    00b (invalid GFN) ->
> >      01b (valid gfn published by kernel, which is dirty) ->
> >        1*b (gfn dirty page collected by userspace) ->
> >          00b (gfn reset by kernel, so goes back to invalid gfn)
> > That is 10b and 11b are equivalent.  The kernel doesn't read that bit if
> > userspace has collected the page.

Yes "1*b" is good too (IMHO as long as we can define three states for
an entry).  However do you want me to change to that?  Note that I
still think we need to read the rest of the field (in this case,
"slot" and "gfn") besides the two bits to do re-protect.  Should we
trust that unconditionally if writable?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-20  7:29           ` Peter Xu
@ 2020-01-20  7:47             ` Michael S. Tsirkin
  2020-01-21  8:29               ` Peter Xu
  2020-01-21 10:24             ` Paolo Bonzini
  1 sibling, 1 reply; 82+ messages in thread
From: Michael S. Tsirkin @ 2020-01-20  7:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm, linux-kernel, Christophe de Dinechin,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Mon, Jan 20, 2020 at 03:29:15PM +0800, Peter Xu wrote:
> On Sun, Jan 19, 2020 at 05:12:35AM -0500, Michael S. Tsirkin wrote:
> > On Sun, Jan 19, 2020 at 10:09:53AM +0100, Paolo Bonzini wrote:
> > > On 09/01/20 20:15, Peter Xu wrote:
> > > > Regarding dropping the indices: I feel like it can be done, though we
> > > > probably need two extra bits for each GFN entry, for example:
> > > > 
> > > >   - Bit 0 of the GFN address to show whether this is a valid publish
> > > >     of dirty gfn
> > > > 
> > > >   - Bit 1 of the GFN address to show whether this is collected by the
> > > >     user
> > > 
> > > We can use bit 62 and 63 of the GFN.
> > 
> > If we are short on bits we can just use 1 bit. E.g. set if
> > userspace has collected the GFN.
> 
> I'm still unsure whether we can use only one bit for this.  Say,
> otherwise how does the userspace knows the entry is valid?  For
> example, the entry with all zeros ({.slot = 0, gfn = 0}) could be
> recognized as a valid dirty page on slot 0 gfn 0, even if it's
> actually an unused entry.

So I guess the reverse: valid entry has bit set, userspace sets it to
0 when it collects it?


> > 
> > > I think this can be done in a secure way.  Later in the thread you say:
> > > 
> > > > We simply check fetch_index (sorry I
> > > > meant this when I said reset_index, anyway it's the only index that we
> > > > expose to userspace) to make sure:
> > > > 
> > > >   reset_index <= fetch_index <= dirty_index
> > > 
> > > So this means that KVM_RESET_DIRTY_RINGS should only test the "collected
> > > by user" flag on dirty ring entries between reset_index and dirty_index.
> > > 
> > > Also I would make it
> > > 
> > >    00b (invalid GFN) ->
> > >      01b (valid gfn published by kernel, which is dirty) ->
> > >        1*b (gfn dirty page collected by userspace) ->
> > >          00b (gfn reset by kernel, so goes back to invalid gfn)
> > > That is 10b and 11b are equivalent.  The kernel doesn't read that bit if
> > > userspace has collected the page.
> 
> Yes "1*b" is good too (IMHO as long as we can define three states for
> an entry).  However do you want me to change to that?  Note that I
> still think we need to read the rest of the field (in this case,
> "slot" and "gfn") besides the two bits to do re-protect.  Should we
> trust that unconditionally if writable?
> 
> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-20  7:47             ` Michael S. Tsirkin
@ 2020-01-21  8:29               ` Peter Xu
  2020-01-21 10:25                 ` Paolo Bonzini
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-21  8:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, kvm, linux-kernel, Christophe de Dinechin,
	Sean Christopherson, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On Mon, Jan 20, 2020 at 02:47:46AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 20, 2020 at 03:29:15PM +0800, Peter Xu wrote:
> > On Sun, Jan 19, 2020 at 05:12:35AM -0500, Michael S. Tsirkin wrote:
> > > On Sun, Jan 19, 2020 at 10:09:53AM +0100, Paolo Bonzini wrote:
> > > > On 09/01/20 20:15, Peter Xu wrote:
> > > > > Regarding dropping the indices: I feel like it can be done, though we
> > > > > probably need two extra bits for each GFN entry, for example:
> > > > > 
> > > > >   - Bit 0 of the GFN address to show whether this is a valid publish
> > > > >     of dirty gfn
> > > > > 
> > > > >   - Bit 1 of the GFN address to show whether this is collected by the
> > > > >     user
> > > > 
> > > > We can use bit 62 and 63 of the GFN.
> > > 
> > > If we are short on bits we can just use 1 bit. E.g. set if
> > > userspace has collected the GFN.
> > 
> > I'm still unsure whether we can use only one bit for this.  Say,
> > otherwise how does the userspace knows the entry is valid?  For
> > example, the entry with all zeros ({.slot = 0, gfn = 0}) could be
> > recognized as a valid dirty page on slot 0 gfn 0, even if it's
> > actually an unused entry.
> 
> So I guess the reverse: valid entry has bit set, userspace sets it to
> 0 when it collects it?

Right, this seems to work.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-20  7:29           ` Peter Xu
  2020-01-20  7:47             ` Michael S. Tsirkin
@ 2020-01-21 10:24             ` Paolo Bonzini
  1 sibling, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-21 10:24 UTC (permalink / raw)
  To: Peter Xu, Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On 20/01/20 08:29, Peter Xu wrote:
>>>
>>>    00b (invalid GFN) ->
>>>      01b (valid gfn published by kernel, which is dirty) ->
>>>        1*b (gfn dirty page collected by userspace) ->
>>>          00b (gfn reset by kernel, so goes back to invalid gfn)
>>> That is 10b and 11b are equivalent.  The kernel doesn't read that bit if
>>> userspace has collected the page.
> Yes "1*b" is good too (IMHO as long as we can define three states for
> an entry).  However do you want me to change to that?  Note that I
> still think we need to read the rest of the field (in this case,
> "slot" and "gfn") besides the two bits to do re-protect.  Should we
> trust that unconditionally if writable?

I think that userspace would only hurt itself if they do so.  As long as
the kernel has a trusted copy of the indices, it's okay.

We have plenty of bits--x86 limits GFNs to 40 bits (52 bits maximum
physical address).  However, even on other architectures GFNs are
limited to address space size - page shift (64-12).

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking
  2020-01-21  8:29               ` Peter Xu
@ 2020-01-21 10:25                 ` Paolo Bonzini
  0 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-21 10:25 UTC (permalink / raw)
  To: Peter Xu, Michael S. Tsirkin
  Cc: kvm, linux-kernel, Christophe de Dinechin, Sean Christopherson,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert, Lei Cao

On 21/01/20 09:29, Peter Xu wrote:
>>>> If we are short on bits we can just use 1 bit. E.g. set if
>>>> userspace has collected the GFN.
>>> I'm still unsure whether we can use only one bit for this.  Say,
>>> otherwise how does the userspace knows the entry is valid?  For
>>> example, the entry with all zeros ({.slot = 0, gfn = 0}) could be
>>> recognized as a valid dirty page on slot 0 gfn 0, even if it's
>>> actually an unused entry.
>> So I guess the reverse: valid entry has bit set, userspace sets it to
>> 0 when it collects it?
> Right, this seems to work.

Yes, that's okay too.

Paolo


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
  2020-01-19  9:01   ` Paolo Bonzini
@ 2020-01-21 15:56   ` Sean Christopherson
  2020-01-21 16:14     ` Paolo Bonzini
  2020-01-28  5:50     ` Peter Xu
  1 sibling, 2 replies; 82+ messages in thread
From: Sean Christopherson @ 2020-01-21 15:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Thu, Jan 09, 2020 at 09:57:17AM -0500, Peter Xu wrote:
> Originally, we have three code paths that can dirty a page without
> vcpu context for X86:
> 
>   - init_rmode_identity_map
>   - init_rmode_tss
>   - kvmgt_rw_gpa
> 
> init_rmode_identity_map and init_rmode_tss will be setup on
> destination VM no matter what (and the guest cannot even see them), so
> it does not make sense to track them at all.
> 
> To do this, allow __x86_set_memory_region() to return the userspace
> address that just allocated to the caller.  Then in both of the
> functions we directly write to the userspace address instead of
> calling kvm_write_*() APIs.  We need to make sure that we have the
> slots_lock held when accessing the userspace address.
> 
> Another trivial change is that we don't need to explicitly clear the
> identity page table root in init_rmode_identity_map() because no
> matter what we'll write to the whole page with 4M huge page entries.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +-
>  arch/x86/kvm/svm.c              |  3 +-
>  arch/x86/kvm/vmx/vmx.c          | 68 ++++++++++++++++-----------------
>  arch/x86/kvm/x86.c              | 18 +++++++--
>  4 files changed, 51 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index eb6673c7d2e3..f536d139b3d2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1618,7 +1618,8 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
>  
>  int kvm_is_in_guest(void);
>  
> -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
> +int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
> +			    unsigned long *uaddr);

No need for a new param, just return a "void __user *" (or "void *" if the
__user part requires lots of casting) and use ERR_PTR() to encode errors in
the return value.  I.e. return the userspace address.

The refactoring to return the address should be done in a separate patch as
prep work for the move to __copy_to_user().

>  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
>  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 8f1b715dfde8..03a344ce7b66 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1698,7 +1698,8 @@ static int avic_init_access_page(struct kvm_vcpu *vcpu)
>  	ret = __x86_set_memory_region(kvm,
>  				      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
>  				      APIC_DEFAULT_PHYS_BASE,
> -				      PAGE_SIZE);
> +				      PAGE_SIZE,
> +				      NULL);
>  	if (ret)
>  		goto out;
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7e3d370209e0..62175a246bcc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3441,34 +3441,28 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -static int init_rmode_tss(struct kvm *kvm)
> +static int init_rmode_tss(struct kvm *kvm, unsigned long *uaddr)

uaddr is not a pointer to an unsigned long, it's a pointer to a TSS.  Given
that it's dereferenced as a "void __user *", it's probably best passed as
exactly that.

This code also needs to be tested by doing unrestricted_guest=0 when
loading kvm_intel, because it's obviously broken.  __x86_set_memory_region()
takes an "unsigned long *", interpreted as a "pointer to a usersepace
address", i.e. a "void __user **".  But the callers are treating the param
as a "unsigned long in userpace", e.g. init_rmode_identity_map() declares
uaddr as an "unsigned long *", when really it should be declaring a
straight "unsigned long" and passing "&uaddr".  The only thing that saves
KVM from dereferencing a bad pointer in __x86_set_memory_region() is that
uaddr is initialized to NULL 

>  {
> -	gfn_t fn;
> +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
>  	u16 data = 0;
>  	int idx, r;
>  
> -	idx = srcu_read_lock(&kvm->srcu);
> -	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
> -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> -	if (r < 0)
> -		goto out;
> +	for (idx = 0; idx < 3; idx++) {
> +		r = __copy_to_user((void __user *)uaddr + PAGE_SIZE * idx,
> +				   zero_page, PAGE_SIZE);
> +		if (r)
> +			return -EFAULT;
> +	}
> +
>  	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
> -	r = kvm_write_guest_page(kvm, fn++, &data,
> -			TSS_IOPB_BASE_OFFSET, sizeof(u16));
> -	if (r < 0)
> -		goto out;
> -	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
> -	if (r < 0)
> -		goto out;
> -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> -	if (r < 0)
> -		goto out;
> +	r = __copy_to_user((void __user *)uaddr + TSS_IOPB_BASE_OFFSET,
> +			   &data, sizeof(data));
> +	if (r)
> +		return -EFAULT;
> +
>  	data = ~0;
> -	r = kvm_write_guest_page(kvm, fn, &data,
> -				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
> -				 sizeof(u8));
> -out:
> -	srcu_read_unlock(&kvm->srcu, idx);
> +	r = __copy_to_user((void __user *)uaddr - 1, &data, sizeof(data));
> +
>  	return r;

Why not "return __copy_to_user();"?

>  }
>  
> @@ -3478,6 +3472,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
>  	int i, r = 0;
>  	kvm_pfn_t identity_map_pfn;
>  	u32 tmp;
> +	unsigned long *uaddr = NULL;

Again, not a pointer to an unsigned long.

>  	/* Protect kvm_vmx->ept_identity_pagetable_done. */
>  	mutex_lock(&kvm->slots_lock);
> @@ -3490,21 +3485,21 @@ static int init_rmode_identity_map(struct kvm *kvm)
>  	identity_map_pfn = kvm_vmx->ept_identity_map_addr >> PAGE_SHIFT;
>  
>  	r = __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
> -				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE);
> +				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE,
> +				    uaddr);
>  	if (r < 0)
>  		goto out;
>  
> -	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
> -	if (r < 0)
> -		goto out;
>  	/* Set up identity-mapping pagetable for EPT in real mode */
>  	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
>  		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
>  			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> -		r = kvm_write_guest_page(kvm, identity_map_pfn,
> -				&tmp, i * sizeof(tmp), sizeof(tmp));
> -		if (r < 0)
> +		r = __copy_to_user((void __user *)uaddr + i * sizeof(tmp),
> +				   &tmp, sizeof(tmp));
> +		if (r) {
> +			r = -EFAULT;
>  			goto out;
> +		}
>  	}
>  	kvm_vmx->ept_identity_pagetable_done = true;
>  
> @@ -3537,7 +3532,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
>  	if (kvm->arch.apic_access_page_done)
>  		goto out;
>  	r = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> -				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE);
> +				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE, NULL);
>  	if (r)
>  		goto out;
>  
> @@ -4478,19 +4473,22 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
>  {
>  	int ret;
> +	unsigned long *uaddr = NULL;
>  
>  	if (enable_unrestricted_guest)
>  		return 0;
>  
>  	mutex_lock(&kvm->slots_lock);
>  	ret = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
> -				      PAGE_SIZE * 3);
> -	mutex_unlock(&kvm->slots_lock);
> -
> +				      PAGE_SIZE * 3, uaddr);
>  	if (ret)
> -		return ret;
> +		goto out;
> +
>  	to_kvm_vmx(kvm)->tss_addr = addr;
> -	return init_rmode_tss(kvm);
> +	ret = init_rmode_tss(kvm, uaddr);
> +out:
> +	mutex_unlock(&kvm->slots_lock);

Unnecessary, see below.

> +	return ret;
>  }
>  
>  static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c4d3972dcd14..ff97782b3919 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
>  	kvm_free_pit(kvm);
>  }
>  
> -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> +/*
> + * If `uaddr' is specified, `*uaddr' will be returned with the
> + * userspace address that was just allocated.  `uaddr' is only
> + * meaningful if the function returns zero, and `uaddr' will only be
> + * valid when with either the slots_lock or with the SRCU read lock
> + * held.  After we release the lock, the returned `uaddr' will be invalid.

This is all incorrect.  Neither of those locks has any bearing on the
validity of the hva.  slots_lock does as the name suggests and prevents
concurrent writes to the memslots.  The SRCU lock ensures the implicit
memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
due to derefencing old memslots.

Neither of those has anything to do with the userspace address, they're
both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
no bearing on the validity of the mapping or address as KVM only increments
mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
but doesn't ensure the vmas or associated pages tables are valid.

Which is the entire point of using __copy_{to,from}_user(), as they
gracefully handle the scenario where the process has not valid mapping
and/or translation for the address.

> + */
> +int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
> +			    unsigned long *uaddr)
>  {
>  	int i, r;
>  	unsigned long hva;

Note, hva is a straight "unsigned long".

> @@ -9608,6 +9616,8 @@ int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
>  			      MAP_SHARED | MAP_ANONYMOUS, 0);
>  		if (IS_ERR((void *)hva))
>  			return PTR_ERR((void *)hva);
> +		if (uaddr)
> +			*uaddr = hva;
>  	} else {
>  		if (!slot->npages)
>  			return 0;

@uaddr should be to zero here.  Actually returning the address as a void *
will force this case to be handled correctly.

> @@ -9651,10 +9661,10 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  		 */
>  		mutex_lock(&kvm->slots_lock);
>  		__x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> -					0, 0);
> +					0, 0, NULL);
>  		__x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
> -					0, 0);
> -		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
> +					0, 0, NULL);
> +		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0, NULL);
>  		mutex_unlock(&kvm->slots_lock);
>  	}
>  	if (kvm_x86_ops->vm_destroy)
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-21 15:56   ` Sean Christopherson
@ 2020-01-21 16:14     ` Paolo Bonzini
  2020-01-28  5:50     ` Peter Xu
  1 sibling, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2020-01-21 16:14 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Yan Zhao, Alex Williamson, Jason Wang, Kevin Kevin,
	Vitaly Kuznetsov, Dr . David Alan Gilbert

On 21/01/20 16:56, Sean Christopherson wrote:
> This code also needs to be tested by doing unrestricted_guest=0 when
> loading kvm_intel, because it's obviously broken.

... as I had just found out after starting tests on kvm/queue.  Unqueued
this patch.

Paolo

> __x86_set_memory_region()
> takes an "unsigned long *", interpreted as a "pointer to a usersepace
> address", i.e. a "void __user **".  But the callers are treating the param
> as a "unsigned long in userpace", e.g. init_rmode_identity_map() declares
> uaddr as an "unsigned long *", when really it should be declaring a
> straight "unsigned long" and passing "&uaddr".  The only thing that saves
> KVM from dereferencing a bad pointer in __x86_set_memory_region() is that
> uaddr is initialized to NULL 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-21 15:56   ` Sean Christopherson
  2020-01-21 16:14     ` Paolo Bonzini
@ 2020-01-28  5:50     ` Peter Xu
  2020-01-28 18:24       ` Sean Christopherson
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-28  5:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> On Thu, Jan 09, 2020 at 09:57:17AM -0500, Peter Xu wrote:
> > Originally, we have three code paths that can dirty a page without
> > vcpu context for X86:
> > 
> >   - init_rmode_identity_map
> >   - init_rmode_tss
> >   - kvmgt_rw_gpa
> > 
> > init_rmode_identity_map and init_rmode_tss will be setup on
> > destination VM no matter what (and the guest cannot even see them), so
> > it does not make sense to track them at all.
> > 
> > To do this, allow __x86_set_memory_region() to return the userspace
> > address that just allocated to the caller.  Then in both of the
> > functions we directly write to the userspace address instead of
> > calling kvm_write_*() APIs.  We need to make sure that we have the
> > slots_lock held when accessing the userspace address.
> > 
> > Another trivial change is that we don't need to explicitly clear the
> > identity page table root in init_rmode_identity_map() because no
> > matter what we'll write to the whole page with 4M huge page entries.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +-
> >  arch/x86/kvm/svm.c              |  3 +-
> >  arch/x86/kvm/vmx/vmx.c          | 68 ++++++++++++++++-----------------
> >  arch/x86/kvm/x86.c              | 18 +++++++--
> >  4 files changed, 51 insertions(+), 41 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index eb6673c7d2e3..f536d139b3d2 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1618,7 +1618,8 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
> >  
> >  int kvm_is_in_guest(void);
> >  
> > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
> > +int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
> > +			    unsigned long *uaddr);
> 
> No need for a new param, just return a "void __user *" (or "void *" if the
> __user part requires lots of casting) and use ERR_PTR() to encode errors in
> the return value.  I.e. return the userspace address.
> 
> The refactoring to return the address should be done in a separate patch as
> prep work for the move to __copy_to_user().

Yes this sounds cleaner, will do.

> 
> >  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
> >  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
> >  
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 8f1b715dfde8..03a344ce7b66 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1698,7 +1698,8 @@ static int avic_init_access_page(struct kvm_vcpu *vcpu)
> >  	ret = __x86_set_memory_region(kvm,
> >  				      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> >  				      APIC_DEFAULT_PHYS_BASE,
> > -				      PAGE_SIZE);
> > +				      PAGE_SIZE,
> > +				      NULL);
> >  	if (ret)
> >  		goto out;
> >  
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7e3d370209e0..62175a246bcc 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -3441,34 +3441,28 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu)
> >  	return true;
> >  }
> >  
> > -static int init_rmode_tss(struct kvm *kvm)
> > +static int init_rmode_tss(struct kvm *kvm, unsigned long *uaddr)
> 
> uaddr is not a pointer to an unsigned long, it's a pointer to a TSS.  Given
> that it's dereferenced as a "void __user *", it's probably best passed as
> exactly that.
> 
> This code also needs to be tested by doing unrestricted_guest=0 when
> loading kvm_intel, because it's obviously broken.  __x86_set_memory_region()
> takes an "unsigned long *", interpreted as a "pointer to a usersepace
> address", i.e. a "void __user **".  But the callers are treating the param
> as a "unsigned long in userpace", e.g. init_rmode_identity_map() declares
> uaddr as an "unsigned long *", when really it should be declaring a
> straight "unsigned long" and passing "&uaddr".  The only thing that saves
> KVM from dereferencing a bad pointer in __x86_set_memory_region() is that
> uaddr is initialized to NULL 

Yes it's broken.  Thanks very much for figuring it out.  I'll test
unrestricted_guest=N.

> 
> >  {
> > -	gfn_t fn;
> > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> >  	u16 data = 0;
> >  	int idx, r;
> >  
> > -	idx = srcu_read_lock(&kvm->srcu);
> > -	fn = to_kvm_vmx(kvm)->tss_addr >> PAGE_SHIFT;
> > -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> > +	for (idx = 0; idx < 3; idx++) {
> > +		r = __copy_to_user((void __user *)uaddr + PAGE_SIZE * idx,
> > +				   zero_page, PAGE_SIZE);
> > +		if (r)
> > +			return -EFAULT;
> > +	}
> > +
> >  	data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
> > -	r = kvm_write_guest_page(kvm, fn++, &data,
> > -			TSS_IOPB_BASE_OFFSET, sizeof(u16));
> > -	if (r < 0)
> > -		goto out;
> > -	r = kvm_clear_guest_page(kvm, fn++, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> > -	r = kvm_clear_guest_page(kvm, fn, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> > +	r = __copy_to_user((void __user *)uaddr + TSS_IOPB_BASE_OFFSET,
> > +			   &data, sizeof(data));
> > +	if (r)
> > +		return -EFAULT;
> > +
> >  	data = ~0;
> > -	r = kvm_write_guest_page(kvm, fn, &data,
> > -				 RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1,
> > -				 sizeof(u8));
> > -out:
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > +	r = __copy_to_user((void __user *)uaddr - 1, &data, sizeof(data));
> > +
> >  	return r;
> 
> Why not "return __copy_to_user();"?

Sure.

> 
> >  }
> >  
> > @@ -3478,6 +3472,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
> >  	int i, r = 0;
> >  	kvm_pfn_t identity_map_pfn;
> >  	u32 tmp;
> > +	unsigned long *uaddr = NULL;
> 
> Again, not a pointer to an unsigned long.
> 
> >  	/* Protect kvm_vmx->ept_identity_pagetable_done. */
> >  	mutex_lock(&kvm->slots_lock);
> > @@ -3490,21 +3485,21 @@ static int init_rmode_identity_map(struct kvm *kvm)
> >  	identity_map_pfn = kvm_vmx->ept_identity_map_addr >> PAGE_SHIFT;
> >  
> >  	r = __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
> > -				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE);
> > +				    kvm_vmx->ept_identity_map_addr, PAGE_SIZE,
> > +				    uaddr);
> >  	if (r < 0)
> >  		goto out;
> >  
> > -	r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
> > -	if (r < 0)
> > -		goto out;
> >  	/* Set up identity-mapping pagetable for EPT in real mode */
> >  	for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
> >  		tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> >  			_PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> > -		r = kvm_write_guest_page(kvm, identity_map_pfn,
> > -				&tmp, i * sizeof(tmp), sizeof(tmp));
> > -		if (r < 0)
> > +		r = __copy_to_user((void __user *)uaddr + i * sizeof(tmp),
> > +				   &tmp, sizeof(tmp));
> > +		if (r) {
> > +			r = -EFAULT;
> >  			goto out;
> > +		}
> >  	}
> >  	kvm_vmx->ept_identity_pagetable_done = true;
> >  
> > @@ -3537,7 +3532,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
> >  	if (kvm->arch.apic_access_page_done)
> >  		goto out;
> >  	r = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> > -				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE);
> > +				    APIC_DEFAULT_PHYS_BASE, PAGE_SIZE, NULL);
> >  	if (r)
> >  		goto out;
> >  
> > @@ -4478,19 +4473,22 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
> >  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> >  {
> >  	int ret;
> > +	unsigned long *uaddr = NULL;
> >  
> >  	if (enable_unrestricted_guest)
> >  		return 0;
> >  
> >  	mutex_lock(&kvm->slots_lock);
> >  	ret = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
> > -				      PAGE_SIZE * 3);
> > -	mutex_unlock(&kvm->slots_lock);
> > -
> > +				      PAGE_SIZE * 3, uaddr);
> >  	if (ret)
> > -		return ret;
> > +		goto out;
> > +
> >  	to_kvm_vmx(kvm)->tss_addr = addr;
> > -	return init_rmode_tss(kvm);
> > +	ret = init_rmode_tss(kvm, uaddr);
> > +out:
> > +	mutex_unlock(&kvm->slots_lock);
> 
> Unnecessary, see below.

Do you mean that we don't even need the lock?

I feel like this could at least fail lockdep.  More below.

[1]

> 
> > +	return ret;
> >  }
> >  
> >  static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c4d3972dcd14..ff97782b3919 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> >  	kvm_free_pit(kvm);
> >  }
> >  
> > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > +/*
> > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > + * userspace address that was just allocated.  `uaddr' is only
> > + * meaningful if the function returns zero, and `uaddr' will only be
> > + * valid when with either the slots_lock or with the SRCU read lock
> > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> 
> This is all incorrect.  Neither of those locks has any bearing on the
> validity of the hva.  slots_lock does as the name suggests and prevents
> concurrent writes to the memslots.  The SRCU lock ensures the implicit
> memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> due to derefencing old memslots.
> 
> Neither of those has anything to do with the userspace address, they're
> both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> no bearing on the validity of the mapping or address as KVM only increments
> mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> but doesn't ensure the vmas or associated pages tables are valid.
> 
> Which is the entire point of using __copy_{to,from}_user(), as they
> gracefully handle the scenario where the process has not valid mapping
> and/or translation for the address.

Sorry I don't understand.

I do think either the slots_lock or SRCU would protect at least the
existing kvm.memslots, and if so at least the previous vm_mmap()
return value should still be valid.  I agree that __copy_to_user()
will protect us from many cases from process mm pov (which allows page
faults inside), but again if the kvm.memslots is changed underneath us
then it's another story, IMHO, and that's why we need either the lock
or SRCU.

Or are you assuming that (1) __x86_set_memory_region() is only for the
3 private kvm memslots, and (2) currently the kvm private memory slots
will never change after VM is created and before VM is destroyed?  If
so, I agree with you.  However I don't see why we need to restrict
__x86_set_memory_region() with that assumption, after all taking a
lock is not expensive in this slow path.  Even if so, we'd better
comment above __x86_set_memory_region() about this, so we know that we
should not use __x86_set_memory_region() for future kvm internal
memslots that are prone to change during VM's lifecycle (while
currently it seems to be a very general interface).

Please let me know if I misunderstood your point.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-28  5:50     ` Peter Xu
@ 2020-01-28 18:24       ` Sean Christopherson
  2020-01-31 15:08         ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2020-01-28 18:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Tue, Jan 28, 2020 at 01:50:05PM +0800, Peter Xu wrote:
> On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index c4d3972dcd14..ff97782b3919 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > >  	kvm_free_pit(kvm);
> > >  }
> > >  
> > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > +/*
> > > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > > + * userspace address that was just allocated.  `uaddr' is only
> > > + * meaningful if the function returns zero, and `uaddr' will only be
> > > + * valid when with either the slots_lock or with the SRCU read lock
> > > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > 
> > This is all incorrect.  Neither of those locks has any bearing on the
> > validity of the hva.  slots_lock does as the name suggests and prevents
> > concurrent writes to the memslots.  The SRCU lock ensures the implicit
> > memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> > due to derefencing old memslots.
> > 
> > Neither of those has anything to do with the userspace address, they're
> > both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> > mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> > no bearing on the validity of the mapping or address as KVM only increments
> > mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> > but doesn't ensure the vmas or associated pages tables are valid.
> > 
> > Which is the entire point of using __copy_{to,from}_user(), as they
> > gracefully handle the scenario where the process has not valid mapping
> > and/or translation for the address.
> 
> Sorry I don't understand.
> 
> I do think either the slots_lock or SRCU would protect at least the
> existing kvm.memslots, and if so at least the previous vm_mmap()
> return value should still be valid.

Nope.  kvm->slots_lock only protects gfn->hva lookups, e.g. userspace can
munmap() the range at any time.

> I agree that __copy_to_user() will protect us from many cases from process
> mm pov (which allows page faults inside), but again if the kvm.memslots is
> changed underneath us then it's another story, IMHO, and that's why we need
> either the lock or SRCU.

No, again, slots_lock and SRCU only protect gfn->hva lookups.

> Or are you assuming that (1) __x86_set_memory_region() is only for the
> 3 private kvm memslots, 

It's not an assumption, the entire purpose of __x86_set_memory_region()
is to provide support for private KVM memslots.

> and (2) currently the kvm private memory slots will never change after VM
> is created and before VM is destroyed?

No, I'm not assuming the private memslots are constant, e.g. the flow in
question, vmx_set_tss_addr() is directly tied to an unprotected ioctl().

KVM's sole responsible for vmx_set_tss_addr() is to not crash the kernel.
Userspace is responsible for ensuring it doesn't break its guests, e.g.
that multiple calls to KVM_SET_TSS_ADDR are properly serialized.

In the existing code, KVM ensures it doesn't crash by holding the SRCU lock
for the duration of init_rmode_tss() so that the gfn->hva lookups in
kvm_clear_guest_page() don't dereference a stale memslots array.  In no way
does that ensure the validity of the resulting hva, e.g. multiple calls to
KVM_SET_TSS_ADDR would race to set vmx->tss_addr and so init_rmode_tss()
could be operating on a stale gpa.

Putting the onus on KVM to ensure atomicity is pointless because concurrent
calls to KVM_SET_TSS_ADDR would still race, i.e. the end value of
vmx->tss_addr would be non-deterministic.  The intregrity of the underlying
TSS would be guaranteed, but that guarantee isn't part of KVM's ABI.

> If so, I agree with you.  However I don't see why we need to restrict
> __x86_set_memory_region() with that assumption, after all taking a
> lock is not expensive in this slow path.

In what way would not holding slots_lock in vmx_set_tss_addr() restrict
__x86_set_memory_region()?  Literally every other usage of
__x86_set_memory_region() holds slots_lock for the duration of creating
the private memslot, because in those flows, KVM *is* responsible for
ensuring correct ordering.

> Even if so, we'd better comment above __x86_set_memory_region() about this,
> so we know that we should not use __x86_set_memory_region() for future kvm
> internal memslots that are prone to change during VM's lifecycle (while
> currently it seems to be a very general interface).

There is no such restriction.  Obviously such a flow would need to ensure
correctness, but hopefully that goes without saying.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-28 18:24       ` Sean Christopherson
@ 2020-01-31 15:08         ` Peter Xu
  2020-01-31 19:33           ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-31 15:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Tue, Jan 28, 2020 at 10:24:03AM -0800, Sean Christopherson wrote:
> On Tue, Jan 28, 2020 at 01:50:05PM +0800, Peter Xu wrote:
> > On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index c4d3972dcd14..ff97782b3919 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > > >  	kvm_free_pit(kvm);
> > > >  }
> > > >  
> > > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > > +/*
> > > > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > > > + * userspace address that was just allocated.  `uaddr' is only
> > > > + * meaningful if the function returns zero, and `uaddr' will only be
> > > > + * valid when with either the slots_lock or with the SRCU read lock
> > > > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > > 
> > > This is all incorrect.  Neither of those locks has any bearing on the
> > > validity of the hva.  slots_lock does as the name suggests and prevents
> > > concurrent writes to the memslots.  The SRCU lock ensures the implicit
> > > memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> > > due to derefencing old memslots.
> > > 
> > > Neither of those has anything to do with the userspace address, they're
> > > both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> > > mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> > > no bearing on the validity of the mapping or address as KVM only increments
> > > mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> > > but doesn't ensure the vmas or associated pages tables are valid.
> > > 
> > > Which is the entire point of using __copy_{to,from}_user(), as they
> > > gracefully handle the scenario where the process has not valid mapping
> > > and/or translation for the address.
> > 
> > Sorry I don't understand.
> > 
> > I do think either the slots_lock or SRCU would protect at least the
> > existing kvm.memslots, and if so at least the previous vm_mmap()
> > return value should still be valid.
> 
> Nope.  kvm->slots_lock only protects gfn->hva lookups, e.g. userspace can
> munmap() the range at any time.

Do we need to consider that?  If the userspace did this then it'll
corrupt itself, and imho private memory slot is not anything special
here comparing to the user memory slots.  For example, the userspace
can unmap any region after KVM_SET_USER_MEMORY_REGION ioctl even if
the region is filled into some of the userspace_addr of
kvm_userspace_memory_region, so the cached userspace_addr can be
invalid, then kvm_write_guest_page() can fail too with the same
reason.  IMHO kvm only need to make sure it handles the failure path
then it's perfectly fine.

> 
> > I agree that __copy_to_user() will protect us from many cases from process
> > mm pov (which allows page faults inside), but again if the kvm.memslots is
> > changed underneath us then it's another story, IMHO, and that's why we need
> > either the lock or SRCU.
> 
> No, again, slots_lock and SRCU only protect gfn->hva lookups.

Yes, then could you further explain why do you think we don't need the
slot lock?  

> 
> > Or are you assuming that (1) __x86_set_memory_region() is only for the
> > 3 private kvm memslots, 
> 
> It's not an assumption, the entire purpose of __x86_set_memory_region()
> is to provide support for private KVM memslots.
> 
> > and (2) currently the kvm private memory slots will never change after VM
> > is created and before VM is destroyed?
> 
> No, I'm not assuming the private memslots are constant, e.g. the flow in
> question, vmx_set_tss_addr() is directly tied to an unprotected ioctl().

Why it's unprotected?  Now vmx_set_tss_add() is protected by the slots
lock so concurrent operation is safe, also it'll return -EEXIST if
called for more than once.

[1]

> 
> KVM's sole responsible for vmx_set_tss_addr() is to not crash the kernel.
> Userspace is responsible for ensuring it doesn't break its guests, e.g.
> that multiple calls to KVM_SET_TSS_ADDR are properly serialized.
> 
> In the existing code, KVM ensures it doesn't crash by holding the SRCU lock
> for the duration of init_rmode_tss() so that the gfn->hva lookups in
> kvm_clear_guest_page() don't dereference a stale memslots array.

Here in the current master branch we have both the RCU lock and the
slot lock held, that's why I think we can safely remove the RCU lock
as long as we're still holding the slots lock.  We can't do the
reverse because otherwise multiple KVM_SET_TSS_ADDR could race.

> In no way
> does that ensure the validity of the resulting hva,

Yes, but as I mentioned, I don't think it's an issue to be considered
by KVM, otherwise we should have the same issue all over the places
when we fetch the cached userspace_addr from any user slots.

> e.g. multiple calls to
> KVM_SET_TSS_ADDR would race to set vmx->tss_addr and so init_rmode_tss()
> could be operating on a stale gpa.

Please refer to [1].

I just want to double-confirm on what we're discussing now. Are you
sure you're suggesting that we should remove the slot lock in
init_rmode_tss()?  Asked because you discussed quite a bit on how the
slot lock should protect GPA->HVA, about concurrency and so on, then
I'm even more comfused...

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 15:08         ` Peter Xu
@ 2020-01-31 19:33           ` Sean Christopherson
  2020-01-31 20:28             ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2020-01-31 19:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 10:08:32AM -0500, Peter Xu wrote:
> On Tue, Jan 28, 2020 at 10:24:03AM -0800, Sean Christopherson wrote:
> > On Tue, Jan 28, 2020 at 01:50:05PM +0800, Peter Xu wrote:
> > > On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index c4d3972dcd14..ff97782b3919 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > > > >  	kvm_free_pit(kvm);
> > > > >  }
> > > > >  
> > > > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > > > +/*
> > > > > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > > > > + * userspace address that was just allocated.  `uaddr' is only
> > > > > + * meaningful if the function returns zero, and `uaddr' will only be
> > > > > + * valid when with either the slots_lock or with the SRCU read lock
> > > > > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > > > 
> > > > This is all incorrect.  Neither of those locks has any bearing on the
> > > > validity of the hva.  slots_lock does as the name suggests and prevents
> > > > concurrent writes to the memslots.  The SRCU lock ensures the implicit
> > > > memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> > > > due to derefencing old memslots.
> > > > 
> > > > Neither of those has anything to do with the userspace address, they're
> > > > both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> > > > mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> > > > no bearing on the validity of the mapping or address as KVM only increments
> > > > mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> > > > but doesn't ensure the vmas or associated pages tables are valid.
> > > > 
> > > > Which is the entire point of using __copy_{to,from}_user(), as they
> > > > gracefully handle the scenario where the process has not valid mapping
> > > > and/or translation for the address.
> > > 
> > > Sorry I don't understand.
> > > 
> > > I do think either the slots_lock or SRCU would protect at least the
> > > existing kvm.memslots, and if so at least the previous vm_mmap()
> > > return value should still be valid.
> > 
> > Nope.  kvm->slots_lock only protects gfn->hva lookups, e.g. userspace can
> > munmap() the range at any time.
> 
> Do we need to consider that?  If the userspace did this then it'll
> corrupt itself, and imho private memory slot is not anything special
> here comparing to the user memory slots.  For example, the userspace
> can unmap any region after KVM_SET_USER_MEMORY_REGION ioctl even if
> the region is filled into some of the userspace_addr of
> kvm_userspace_memory_region, so the cached userspace_addr can be
> invalid, then kvm_write_guest_page() can fail too with the same
> reason.  IMHO kvm only need to make sure it handles the failure path
> then it's perfectly fine.

Yes?  No?  My point is that your original comment's assertion that "'uaddr'
will only be valid when with either the slots_lock or with the SRCU read
lock held." is wrong and misleading.

> > > I agree that __copy_to_user() will protect us from many cases from process
> > > mm pov (which allows page faults inside), but again if the kvm.memslots is
> > > changed underneath us then it's another story, IMHO, and that's why we need
> > > either the lock or SRCU.
> > 
> > No, again, slots_lock and SRCU only protect gfn->hva lookups.
> 
> Yes, then could you further explain why do you think we don't need the
> slot lock?  

For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
still has to use copy_{to,from}_user().

In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
would be provided the hva of the memory region.  Since slots_lock and SRCU
only protect gfn->hva, why would KVM take slots_lock since it already has
the hva?

> > > Or are you assuming that (1) __x86_set_memory_region() is only for the
> > > 3 private kvm memslots, 
> > 
> > It's not an assumption, the entire purpose of __x86_set_memory_region()
> > is to provide support for private KVM memslots.
> > 
> > > and (2) currently the kvm private memory slots will never change after VM
> > > is created and before VM is destroyed?
> > 
> > No, I'm not assuming the private memslots are constant, e.g. the flow in
> > question, vmx_set_tss_addr() is directly tied to an unprotected ioctl().
> 
> Why it's unprotected?

Because it doesn't need to be protected.

> Now vmx_set_tss_add() is protected by the slots lock so concurrent operation
> is safe, also it'll return -EEXIST if called for more than once.

Returning -EEXIST is an ABI change, e.g. userspace can currently call
KVM_SET_TSS_ADDR any number of times, it just needs to ensure proper
serialization between calls.

If you want to change the ABI, then submit a patch to do exactly that.
But don't bury an ABI change under the pretense that it's a bug fix.

> [1]
> 
> > 
> > KVM's sole responsible for vmx_set_tss_addr() is to not crash the kernel.
> > Userspace is responsible for ensuring it doesn't break its guests, e.g.
> > that multiple calls to KVM_SET_TSS_ADDR are properly serialized.
> > 
> > In the existing code, KVM ensures it doesn't crash by holding the SRCU lock
> > for the duration of init_rmode_tss() so that the gfn->hva lookups in
> > kvm_clear_guest_page() don't dereference a stale memslots array.
> 
> Here in the current master branch we have both the RCU lock and the
> slot lock held, that's why I think we can safely remove the RCU lock
> as long as we're still holding the slots lock.  We can't do the
> reverse because otherwise multiple KVM_SET_TSS_ADDR could race.

Your wording is all messed up.  "we have both the RCU lock and the slot
lock held" is wrong.  KVM holds slot_lock around __x86_set_memory_region(),
because changing the memslots must be mutually exclusive.  It then *drops*
slots_lock because it's done writing the memslots and grabs the SRCU lock
in order to protect the gfn->hva lookups done by init_rmode_tss().  It
*intentionally* drops slots_lock because writing init_rmode_tss() does not
need to be a mutually exclusive operation, per KVM's existing ABI.

If KVM held both slots_lock and SRCU then __x86_set_memory_region() would
deadlock on synchronize_srcu().

> > In no way
> > does that ensure the validity of the resulting hva,
> 
> Yes, but as I mentioned, I don't think it's an issue to be considered
> by KVM, otherwise we should have the same issue all over the places
> when we fetch the cached userspace_addr from any user slots.

Huh?  Of course it's an issue that needs to be considered by KVM, e.g.
kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user() for
giggles.

> > e.g. multiple calls to
> > KVM_SET_TSS_ADDR would race to set vmx->tss_addr and so init_rmode_tss()
> > could be operating on a stale gpa.
> 
> Please refer to [1].
> 
> I just want to double-confirm on what we're discussing now. Are you
> sure you're suggesting that we should remove the slot lock in
> init_rmode_tss()?  Asked because you discussed quite a bit on how the
> slot lock should protect GPA->HVA, about concurrency and so on, then
> I'm even more comfused...

Yes, if init_rmode_tss() is provided the hva then it does not need to
grab srcu_read_lock(&kvm->srcu) because it can directly call
__copy_{to,from}_user() instead of bouncing through the KVM helpers that
translate a gfn to hva.

The code can look like this.  That being said, I've completely lost track
of why __x86_set_memory_region() needs to provide the hva, i.e. have no
idea if we *should* do this, or it would be better to keep the current
code, which would be slower, but less custom.

static int init_rmode_tss(void __user *hva)
{
	const void *zero_page = (const void *)__va(page_to_phys(ZERO_PAGE(0)));
	u16 data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
	int r;

	r = __copy_to_user(hva, zero_page, PAGE_SIZE);
	if (r)
		return -EFAULT;

	r = __copy_to_user(hva + TSS_IOPB_BASE_OFFSET, &data, sizeof(u16))
	if (r)
		return -EFAULT;

	hva += PAGE_SIZE;
	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
	if (r)
		return -EFAULT;

	hva += PAGE_SIZE;
	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
	if (r)
		return -EFAULT;

	data = ~0;
	hva += RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1;
	r = __copy_to_user(hva, &data, sizeof(u16))
	if (r)
		return -EFAULT;
}

static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
	void __user *hva;

	if (enable_unrestricted_guest)
		return 0;

	mutex_lock(&kvm->slots_lock);
	hva = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
				      PAGE_SIZE * 3);
	mutex_unlock(&kvm->slots_lock);

	if (IS_ERR(hva))
		return PTR_ERR(hva);

	to_kvm_vmx(kvm)->tss_addr = addr;
	return init_rmode_tss(hva);
}

Yes, userspace can corrupt its VM by invoking KVM_SET_TSS_ADDR multiple
times without serializing the calls, but that's already true today.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 19:33           ` Sean Christopherson
@ 2020-01-31 20:28             ` Peter Xu
  2020-01-31 20:36               ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-31 20:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> On Fri, Jan 31, 2020 at 10:08:32AM -0500, Peter Xu wrote:
> > On Tue, Jan 28, 2020 at 10:24:03AM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 28, 2020 at 01:50:05PM +0800, Peter Xu wrote:
> > > > On Tue, Jan 21, 2020 at 07:56:57AM -0800, Sean Christopherson wrote:
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index c4d3972dcd14..ff97782b3919 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -9584,7 +9584,15 @@ void kvm_arch_sync_events(struct kvm *kvm)
> > > > > >  	kvm_free_pit(kvm);
> > > > > >  }
> > > > > >  
> > > > > > -int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
> > > > > > +/*
> > > > > > + * If `uaddr' is specified, `*uaddr' will be returned with the
> > > > > > + * userspace address that was just allocated.  `uaddr' is only
> > > > > > + * meaningful if the function returns zero, and `uaddr' will only be
> > > > > > + * valid when with either the slots_lock or with the SRCU read lock
> > > > > > + * held.  After we release the lock, the returned `uaddr' will be invalid.
> > > > > 
> > > > > This is all incorrect.  Neither of those locks has any bearing on the
> > > > > validity of the hva.  slots_lock does as the name suggests and prevents
> > > > > concurrent writes to the memslots.  The SRCU lock ensures the implicit
> > > > > memslots lookup in kvm_clear_guest_page() won't result in a use-after-free
> > > > > due to derefencing old memslots.
> > > > > 
> > > > > Neither of those has anything to do with the userspace address, they're
> > > > > both fully tied to KVM's gfn->hva lookup.  As Paolo pointed out, KVM's
> > > > > mapping is instead tied to the lifecycle of the VM.  Note, even *that* has
> > > > > no bearing on the validity of the mapping or address as KVM only increments
> > > > > mm_count, not mm_users, i.e. guarantees the mm struct itself won't be freed
> > > > > but doesn't ensure the vmas or associated pages tables are valid.
> > > > > 
> > > > > Which is the entire point of using __copy_{to,from}_user(), as they
> > > > > gracefully handle the scenario where the process has not valid mapping
> > > > > and/or translation for the address.
> > > > 
> > > > Sorry I don't understand.
> > > > 
> > > > I do think either the slots_lock or SRCU would protect at least the
> > > > existing kvm.memslots, and if so at least the previous vm_mmap()
> > > > return value should still be valid.
> > > 
> > > Nope.  kvm->slots_lock only protects gfn->hva lookups, e.g. userspace can
> > > munmap() the range at any time.
> > 
> > Do we need to consider that?  If the userspace did this then it'll
> > corrupt itself, and imho private memory slot is not anything special
> > here comparing to the user memory slots.  For example, the userspace
> > can unmap any region after KVM_SET_USER_MEMORY_REGION ioctl even if
> > the region is filled into some of the userspace_addr of
> > kvm_userspace_memory_region, so the cached userspace_addr can be
> > invalid, then kvm_write_guest_page() can fail too with the same
> > reason.  IMHO kvm only need to make sure it handles the failure path
> > then it's perfectly fine.
> 
> Yes?  No?  My point is that your original comment's assertion that "'uaddr'
> will only be valid when with either the slots_lock or with the SRCU read
> lock held." is wrong and misleading.

Yes I'll fix that.

> 
> > > > I agree that __copy_to_user() will protect us from many cases from process
> > > > mm pov (which allows page faults inside), but again if the kvm.memslots is
> > > > changed underneath us then it's another story, IMHO, and that's why we need
> > > > either the lock or SRCU.
> > > 
> > > No, again, slots_lock and SRCU only protect gfn->hva lookups.
> > 
> > Yes, then could you further explain why do you think we don't need the
> > slot lock?  
> 
> For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> still has to use copy_{to,from}_user().
> 
> In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> would be provided the hva of the memory region.  Since slots_lock and SRCU
> only protect gfn->hva, why would KVM take slots_lock since it already has
> the hva?

OK so you're suggesting to unlock the lock earlier to not cover
init_rmode_tss() rather than dropping the whole lock...  Yes it looks
good to me.  I think that's the major confusion I got.

> 
> > > > Or are you assuming that (1) __x86_set_memory_region() is only for the
> > > > 3 private kvm memslots, 
> > > 
> > > It's not an assumption, the entire purpose of __x86_set_memory_region()
> > > is to provide support for private KVM memslots.
> > > 
> > > > and (2) currently the kvm private memory slots will never change after VM
> > > > is created and before VM is destroyed?
> > > 
> > > No, I'm not assuming the private memslots are constant, e.g. the flow in
> > > question, vmx_set_tss_addr() is directly tied to an unprotected ioctl().
> > 
> > Why it's unprotected?
> 
> Because it doesn't need to be protected.
> 
> > Now vmx_set_tss_add() is protected by the slots lock so concurrent operation
> > is safe, also it'll return -EEXIST if called for more than once.
> 
> Returning -EEXIST is an ABI change, e.g. userspace can currently call
> KVM_SET_TSS_ADDR any number of times, it just needs to ensure proper
> serialization between calls.
> 
> If you want to change the ABI, then submit a patch to do exactly that.
> But don't bury an ABI change under the pretense that it's a bug fix.

Could you explain what do you mean by "ABI change"?

I was talking about the original code, not after applying the
patchset.  To be explicit, I mean [a] below:

int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
			    unsigned long *uaddr)
{
	int i, r;
	unsigned long hva;
	struct kvm_memslots *slots = kvm_memslots(kvm);
	struct kvm_memory_slot *slot, old;

	/* Called with kvm->slots_lock held.  */
	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
		return -EINVAL;

	slot = id_to_memslot(slots, id);
	if (size) {
		if (slot->npages)
			return -EEXIST;  <------------------------ [a]
        }
        ...
}

> 
> > [1]
> > 
> > > 
> > > KVM's sole responsible for vmx_set_tss_addr() is to not crash the kernel.
> > > Userspace is responsible for ensuring it doesn't break its guests, e.g.
> > > that multiple calls to KVM_SET_TSS_ADDR are properly serialized.
> > > 
> > > In the existing code, KVM ensures it doesn't crash by holding the SRCU lock
> > > for the duration of init_rmode_tss() so that the gfn->hva lookups in
> > > kvm_clear_guest_page() don't dereference a stale memslots array.
> > 
> > Here in the current master branch we have both the RCU lock and the
> > slot lock held, that's why I think we can safely remove the RCU lock
> > as long as we're still holding the slots lock.  We can't do the
> > reverse because otherwise multiple KVM_SET_TSS_ADDR could race.
> 
> Your wording is all messed up.  "we have both the RCU lock and the slot
> lock held" is wrong.

I did mess up with 2a5755bb21ee2.  We didn't take both lock here,
sorry.

> KVM holds slot_lock around __x86_set_memory_region(),
> because changing the memslots must be mutually exclusive.  It then *drops*
> slots_lock because it's done writing the memslots and grabs the SRCU lock
> in order to protect the gfn->hva lookups done by init_rmode_tss().  It
> *intentionally* drops slots_lock because writing init_rmode_tss() does not
> need to be a mutually exclusive operation, per KVM's existing ABI.
> 
> If KVM held both slots_lock and SRCU then __x86_set_memory_region() would
> deadlock on synchronize_srcu().
> 
> > > In no way
> > > does that ensure the validity of the resulting hva,
> > 
> > Yes, but as I mentioned, I don't think it's an issue to be considered
> > by KVM, otherwise we should have the same issue all over the places
> > when we fetch the cached userspace_addr from any user slots.
> 
> Huh?  Of course it's an issue that needs to be considered by KVM, e.g.
> kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user() for
> giggles.

The cache is for the GPA->HVA translation (struct gfn_to_hva_cache),
we still use __copy_{to,}from_user() upon the HVAs, no?

> 
> > > e.g. multiple calls to
> > > KVM_SET_TSS_ADDR would race to set vmx->tss_addr and so init_rmode_tss()
> > > could be operating on a stale gpa.
> > 
> > Please refer to [1].
> > 
> > I just want to double-confirm on what we're discussing now. Are you
> > sure you're suggesting that we should remove the slot lock in
> > init_rmode_tss()?  Asked because you discussed quite a bit on how the
> > slot lock should protect GPA->HVA, about concurrency and so on, then
> > I'm even more comfused...
> 
> Yes, if init_rmode_tss() is provided the hva then it does not need to
> grab srcu_read_lock(&kvm->srcu) because it can directly call
> __copy_{to,from}_user() instead of bouncing through the KVM helpers that
> translate a gfn to hva.
> 
> The code can look like this.  That being said, I've completely lost track
> of why __x86_set_memory_region() needs to provide the hva, i.e. have no
> idea if we *should* do this, or it would be better to keep the current
> code, which would be slower, but less custom.
> 
> static int init_rmode_tss(void __user *hva)
> {
> 	const void *zero_page = (const void *)__va(page_to_phys(ZERO_PAGE(0)));
> 	u16 data = TSS_BASE_SIZE + TSS_REDIRECTION_SIZE;
> 	int r;
> 
> 	r = __copy_to_user(hva, zero_page, PAGE_SIZE);
> 	if (r)
> 		return -EFAULT;
> 
> 	r = __copy_to_user(hva + TSS_IOPB_BASE_OFFSET, &data, sizeof(u16))
> 	if (r)
> 		return -EFAULT;
> 
> 	hva += PAGE_SIZE;
> 	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
> 	if (r)
> 		return -EFAULT;
> 
> 	hva += PAGE_SIZE;
> 	r = __copy_to_user(hva + PAGE_SIZE, zero_page, PAGE_SIZE);
> 	if (r)
> 		return -EFAULT;
> 
> 	data = ~0;
> 	hva += RMODE_TSS_SIZE - 2 * PAGE_SIZE - 1;
> 	r = __copy_to_user(hva, &data, sizeof(u16))
> 	if (r)
> 		return -EFAULT;
> }
> 
> static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> {
> 	void __user *hva;
> 
> 	if (enable_unrestricted_guest)
> 		return 0;
> 
> 	mutex_lock(&kvm->slots_lock);
> 	hva = __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, addr,
> 				      PAGE_SIZE * 3);
> 	mutex_unlock(&kvm->slots_lock);
> 
> 	if (IS_ERR(hva))
> 		return PTR_ERR(hva);
> 
> 	to_kvm_vmx(kvm)->tss_addr = addr;
> 	return init_rmode_tss(hva);
> }
> 
> Yes, userspace can corrupt its VM by invoking KVM_SET_TSS_ADDR multiple
> times without serializing the calls, but that's already true today.

But I still don't see why we have any problem here.  Only the first
thread will get the slots_lock here and succeed this ioctl.  The rest
threads will fail with -EEXIST, no?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 20:28             ` Peter Xu
@ 2020-01-31 20:36               ` Sean Christopherson
  2020-01-31 20:55                 ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2020-01-31 20:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 03:28:24PM -0500, Peter Xu wrote:
> On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> > For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> > still has to use copy_{to,from}_user().
> > 
> > In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> > would be provided the hva of the memory region.  Since slots_lock and SRCU
> > only protect gfn->hva, why would KVM take slots_lock since it already has
> > the hva?
> 
> OK so you're suggesting to unlock the lock earlier to not cover
> init_rmode_tss() rather than dropping the whole lock...  Yes it looks
> good to me.  I think that's the major confusion I got.

Ya.  And I missed where the -EEXIST was coming from.  I think we're on the
same page.

> > Returning -EEXIST is an ABI change, e.g. userspace can currently call
> > KVM_SET_TSS_ADDR any number of times, it just needs to ensure proper
> > serialization between calls.
> > 
> > If you want to change the ABI, then submit a patch to do exactly that.
> > But don't bury an ABI change under the pretense that it's a bug fix.
> 
> Could you explain what do you mean by "ABI change"?
> 
> I was talking about the original code, not after applying the
> patchset.  To be explicit, I mean [a] below:
> 
> int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
> 			    unsigned long *uaddr)
> {
> 	int i, r;
> 	unsigned long hva;
> 	struct kvm_memslots *slots = kvm_memslots(kvm);
> 	struct kvm_memory_slot *slot, old;
> 
> 	/* Called with kvm->slots_lock held.  */
> 	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
> 		return -EINVAL;
> 
> 	slot = id_to_memslot(slots, id);
> 	if (size) {
> 		if (slot->npages)
> 			return -EEXIST;  <------------------------ [a]
>         }
>         ...
> }

Doh, I completely forgot that the second __x86_set_memory_region() would
fail.  Sorry :-(

> > > Yes, but as I mentioned, I don't think it's an issue to be considered
> > > by KVM, otherwise we should have the same issue all over the places
> > > when we fetch the cached userspace_addr from any user slots.
> > 
> > Huh?  Of course it's an issue that needs to be considered by KVM, e.g.
> > kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user() for
> > giggles.
> 
> The cache is for the GPA->HVA translation (struct gfn_to_hva_cache),
> we still use __copy_{to,}from_user() upon the HVAs, no?

I'm still lost on this one.  I'm pretty sure I'm incorrectly interpreting:
  
  I don't think it's an issue to be considered by KVM, otherwise we should
  have the same issue all over the places when we fetch the cached
  userspace_addr from any user slots.

What is the issue to which you are referring?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 20:36               ` Sean Christopherson
@ 2020-01-31 20:55                 ` Peter Xu
  2020-01-31 21:29                   ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-31 20:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 12:36:22PM -0800, Sean Christopherson wrote:
> On Fri, Jan 31, 2020 at 03:28:24PM -0500, Peter Xu wrote:
> > On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> > > For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> > > still has to use copy_{to,from}_user().
> > > 
> > > In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> > > would be provided the hva of the memory region.  Since slots_lock and SRCU
> > > only protect gfn->hva, why would KVM take slots_lock since it already has
> > > the hva?
> > 
> > OK so you're suggesting to unlock the lock earlier to not cover
> > init_rmode_tss() rather than dropping the whole lock...  Yes it looks
> > good to me.  I think that's the major confusion I got.
> 
> Ya.  And I missed where the -EEXIST was coming from.  I think we're on the
> same page.

Good to know.  Btw, for me I would still prefer to keep the lock be
after the __copy_to_user()s because "HVA is valid without lock" is
only true for these private memslots.  After all this is super slow
path so I wouldn't mind to take the lock for some time longer.  Or
otherwise if you really like the unlock() to be earlier I can comment
above the unlock:

  /*
   * We can unlock before using the HVA only because this KVM private
   * memory slot will never change until the end of VM lifecycle.
   */

> 
> > > Returning -EEXIST is an ABI change, e.g. userspace can currently call
> > > KVM_SET_TSS_ADDR any number of times, it just needs to ensure proper
> > > serialization between calls.
> > > 
> > > If you want to change the ABI, then submit a patch to do exactly that.
> > > But don't bury an ABI change under the pretense that it's a bug fix.
> > 
> > Could you explain what do you mean by "ABI change"?
> > 
> > I was talking about the original code, not after applying the
> > patchset.  To be explicit, I mean [a] below:
> > 
> > int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size,
> > 			    unsigned long *uaddr)
> > {
> > 	int i, r;
> > 	unsigned long hva;
> > 	struct kvm_memslots *slots = kvm_memslots(kvm);
> > 	struct kvm_memory_slot *slot, old;
> > 
> > 	/* Called with kvm->slots_lock held.  */
> > 	if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
> > 		return -EINVAL;
> > 
> > 	slot = id_to_memslot(slots, id);
> > 	if (size) {
> > 		if (slot->npages)
> > 			return -EEXIST;  <------------------------ [a]
> >         }
> >         ...
> > }
> 
> Doh, I completely forgot that the second __x86_set_memory_region() would
> fail.  Sorry :-(
> 
> > > > Yes, but as I mentioned, I don't think it's an issue to be considered
> > > > by KVM, otherwise we should have the same issue all over the places
> > > > when we fetch the cached userspace_addr from any user slots.
> > > 
> > > Huh?  Of course it's an issue that needs to be considered by KVM, e.g.
> > > kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user() for
> > > giggles.
> > 
> > The cache is for the GPA->HVA translation (struct gfn_to_hva_cache),
> > we still use __copy_{to,}from_user() upon the HVAs, no?
> 
> I'm still lost on this one.  I'm pretty sure I'm incorrectly interpreting:
>   
>   I don't think it's an issue to be considered by KVM, otherwise we should
>   have the same issue all over the places when we fetch the cached
>   userspace_addr from any user slots.
> 
> What is the issue to which you are referring?

The issue I was referring to is "HVA can be unmapped by the userspace
without KVM's notice".  I think actually we're on the same page too
here, my follow-up question is really a pure question for when you say
"kvm_{read,write}_guest_cached() aren't using __copy_{to,}from_user()"
above - because that's against my understanding.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 20:55                 ` Peter Xu
@ 2020-01-31 21:29                   ` Sean Christopherson
  2020-01-31 22:16                     ` Peter Xu
  0 siblings, 1 reply; 82+ messages in thread
From: Sean Christopherson @ 2020-01-31 21:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 03:55:50PM -0500, Peter Xu wrote:
> On Fri, Jan 31, 2020 at 12:36:22PM -0800, Sean Christopherson wrote:
> > On Fri, Jan 31, 2020 at 03:28:24PM -0500, Peter Xu wrote:
> > > On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> > > > For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> > > > still has to use copy_{to,from}_user().
> > > > 
> > > > In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> > > > would be provided the hva of the memory region.  Since slots_lock and SRCU
> > > > only protect gfn->hva, why would KVM take slots_lock since it already has
> > > > the hva?
> > > 
> > > OK so you're suggesting to unlock the lock earlier to not cover
> > > init_rmode_tss() rather than dropping the whole lock...  Yes it looks
> > > good to me.  I think that's the major confusion I got.
> > 
> > Ya.  And I missed where the -EEXIST was coming from.  I think we're on the
> > same page.
> 
> Good to know.  Btw, for me I would still prefer to keep the lock be
> after the __copy_to_user()s because "HVA is valid without lock" is
> only true for these private memslots.

No.  From KVM's perspective, the HVA is *never* valid.  Even if you rewrote
this statement to say "the gfn->hva translation is valid without lock" it
would still be incorrect. 

KVM is *always* using HVAs without holding lock, e.g. every time it enters
the guest it is deferencing a memslot because the translations stored in
the TLB are effectively gfn->hva->hpa.  Obviously KVM ensures that it won't
dereference a memslot that has been deleted/moved, but it's a lot more
subtle than simply holding a lock.

> After all this is super slow path so I wouldn't mind to take the lock
> for some time longer.

Holding the lock doesn't affect this super slow vmx_set_tss_addr(), it
affects everything else that wants slots_lock.  Now, admittedly it's
extremely unlikely userspace is going to do KVM_SET_USER_MEMORY_REGION in
parallel, but that's not the point and it's not why I'm objecting to
holding the lock.

Holding the lock implies protection that is *not* provided.  You and I know
it's not needed for copy_{to,from}_user(), but look how long it's taken us
to get on the same page.  A future KVM developer comes along, sees this
code, and thinks "oh, I need to hold slots_lock to dereference a gfn", and
propagates the unnecessary locking to some other code.

> Or otherwise if you really like the unlock() to
> be earlier I can comment above the unlock:
> 
>   /*
>    * We can unlock before using the HVA only because this KVM private
>    * memory slot will never change until the end of VM lifecycle.
>    */

How about:

	/*
	 * No need to hold slots_lock while filling the TSS, the TSS private
	 * memslot is guaranteed to be valid until the VM is destroyed, i.e.
	 * there is no danger of corrupting guest memory by consuming a stale
	 * gfn->hva lookup.
	 */

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 21:29                   ` Sean Christopherson
@ 2020-01-31 22:16                     ` Peter Xu
  2020-01-31 22:20                       ` Sean Christopherson
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Xu @ 2020-01-31 22:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 01:29:28PM -0800, Sean Christopherson wrote:
> On Fri, Jan 31, 2020 at 03:55:50PM -0500, Peter Xu wrote:
> > On Fri, Jan 31, 2020 at 12:36:22PM -0800, Sean Christopherson wrote:
> > > On Fri, Jan 31, 2020 at 03:28:24PM -0500, Peter Xu wrote:
> > > > On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> > > > > For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> > > > > still has to use copy_{to,from}_user().
> > > > > 
> > > > > In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> > > > > would be provided the hva of the memory region.  Since slots_lock and SRCU
> > > > > only protect gfn->hva, why would KVM take slots_lock since it already has
> > > > > the hva?
> > > > 
> > > > OK so you're suggesting to unlock the lock earlier to not cover
> > > > init_rmode_tss() rather than dropping the whole lock...  Yes it looks
> > > > good to me.  I think that's the major confusion I got.
> > > 
> > > Ya.  And I missed where the -EEXIST was coming from.  I think we're on the
> > > same page.
> > 
> > Good to know.  Btw, for me I would still prefer to keep the lock be
> > after the __copy_to_user()s because "HVA is valid without lock" is
> > only true for these private memslots.
> 
> No.  From KVM's perspective, the HVA is *never* valid.  Even if you rewrote
> this statement to say "the gfn->hva translation is valid without lock" it
> would still be incorrect. 
> 
> KVM is *always* using HVAs without holding lock, e.g. every time it enters
> the guest it is deferencing a memslot because the translations stored in
> the TLB are effectively gfn->hva->hpa.  Obviously KVM ensures that it won't
> dereference a memslot that has been deleted/moved, but it's a lot more
> subtle than simply holding a lock.
> 
> > After all this is super slow path so I wouldn't mind to take the lock
> > for some time longer.
> 
> Holding the lock doesn't affect this super slow vmx_set_tss_addr(), it
> affects everything else that wants slots_lock.  Now, admittedly it's
> extremely unlikely userspace is going to do KVM_SET_USER_MEMORY_REGION in
> parallel, but that's not the point and it's not why I'm objecting to
> holding the lock.
> 
> Holding the lock implies protection that is *not* provided.  You and I know
> it's not needed for copy_{to,from}_user(), but look how long it's taken us
> to get on the same page.  A future KVM developer comes along, sees this
> code, and thinks "oh, I need to hold slots_lock to dereference a gfn", and
> propagates the unnecessary locking to some other code.

At least for a user memory slot, we "need to hold slots_lock to
dereference a gfn" (or srcu), right?

You know I'm suffering from a jetlag today, I thought I was still
fine, now I start to doubt it. :-)

> 
> > Or otherwise if you really like the unlock() to
> > be earlier I can comment above the unlock:
> > 
> >   /*
> >    * We can unlock before using the HVA only because this KVM private
> >    * memory slot will never change until the end of VM lifecycle.
> >    */
> 
> How about:
> 
> 	/*
> 	 * No need to hold slots_lock while filling the TSS, the TSS private
> 	 * memslot is guaranteed to be valid until the VM is destroyed, i.e.
> 	 * there is no danger of corrupting guest memory by consuming a stale
> 	 * gfn->hva lookup.
> 	 */

Sure for this.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  2020-01-31 22:16                     ` Peter Xu
@ 2020-01-31 22:20                       ` Sean Christopherson
  0 siblings, 0 replies; 82+ messages in thread
From: Sean Christopherson @ 2020-01-31 22:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-kernel, Christophe de Dinechin, Michael S . Tsirkin,
	Paolo Bonzini, Yan Zhao, Alex Williamson, Jason Wang,
	Kevin Kevin, Vitaly Kuznetsov, Dr . David Alan Gilbert

On Fri, Jan 31, 2020 at 05:16:37PM -0500, Peter Xu wrote:
> On Fri, Jan 31, 2020 at 01:29:28PM -0800, Sean Christopherson wrote:
> > On Fri, Jan 31, 2020 at 03:55:50PM -0500, Peter Xu wrote:
> > > On Fri, Jan 31, 2020 at 12:36:22PM -0800, Sean Christopherson wrote:
> > > > On Fri, Jan 31, 2020 at 03:28:24PM -0500, Peter Xu wrote:
> > > > > On Fri, Jan 31, 2020 at 11:33:01AM -0800, Sean Christopherson wrote:
> > > > > > For the same reason we don't take mmap_sem, it gains us nothing, i.e. KVM
> > > > > > still has to use copy_{to,from}_user().
> > > > > > 
> > > > > > In the proposed __x86_set_memory_region() refactor, vmx_set_tss_addr()
> > > > > > would be provided the hva of the memory region.  Since slots_lock and SRCU
> > > > > > only protect gfn->hva, why would KVM take slots_lock since it already has
> > > > > > the hva?
> > > > > 
> > > > > OK so you're suggesting to unlock the lock earlier to not cover
> > > > > init_rmode_tss() rather than dropping the whole lock...  Yes it looks
> > > > > good to me.  I think that's the major confusion I got.
> > > > 
> > > > Ya.  And I missed where the -EEXIST was coming from.  I think we're on the
> > > > same page.
> > > 
> > > Good to know.  Btw, for me I would still prefer to keep the lock be
> > > after the __copy_to_user()s because "HVA is valid without lock" is
> > > only true for these private memslots.
> > 
> > No.  From KVM's perspective, the HVA is *never* valid.  Even if you rewrote
> > this statement to say "the gfn->hva translation is valid without lock" it
> > would still be incorrect. 
> > 
> > KVM is *always* using HVAs without holding lock, e.g. every time it enters
> > the guest it is deferencing a memslot because the translations stored in
> > the TLB are effectively gfn->hva->hpa.  Obviously KVM ensures that it won't
> > dereference a memslot that has been deleted/moved, but it's a lot more
> > subtle than simply holding a lock.
> > 
> > > After all this is super slow path so I wouldn't mind to take the lock
> > > for some time longer.
> > 
> > Holding the lock doesn't affect this super slow vmx_set_tss_addr(), it
> > affects everything else that wants slots_lock.  Now, admittedly it's
> > extremely unlikely userspace is going to do KVM_SET_USER_MEMORY_REGION in
> > parallel, but that's not the point and it's not why I'm objecting to
> > holding the lock.
> > 
> > Holding the lock implies protection that is *not* provided.  You and I know
> > it's not needed for copy_{to,from}_user(), but look how long it's taken us
> > to get on the same page.  A future KVM developer comes along, sees this
> > code, and thinks "oh, I need to hold slots_lock to dereference a gfn", and
> > propagates the unnecessary locking to some other code.
> 
> At least for a user memory slot, we "need to hold slots_lock to
> dereference a gfn" (or srcu), right?

Gah, that was supposed to be "dereference a hva".  Yes, a gfn->hva lookup
requires slots_lock or SRCU read lock.

> You know I'm suffering from a jetlag today, I thought I was still
> fine, now I start to doubt it. :-)

Unintentional gaslighting.  Or was it?  :-D

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2020-01-31 22:20 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-09 14:57 [PATCH v3 00/21] KVM: Dirty ring interface Peter Xu
2020-01-09 14:57 ` [PATCH v3 01/21] vfio: introduce vfio_iova_rw to read/write a range of IOVAs Peter Xu
2020-01-09 14:57 ` [PATCH v3 02/21] drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw Peter Xu
2020-01-09 14:57 ` [PATCH v3 03/21] KVM: Remove kvm_read_guest_atomic() Peter Xu
2020-01-09 14:57 ` [PATCH v3 04/21] KVM: Add build-time error check on kvm_run size Peter Xu
2020-01-09 14:57 ` [PATCH v3 05/21] KVM: X86: Change parameter for fast_page_fault tracepoint Peter Xu
2020-01-09 14:57 ` [PATCH v3 06/21] KVM: X86: Don't take srcu lock in init_rmode_identity_map() Peter Xu
2020-01-09 14:57 ` [PATCH v3 07/21] KVM: Cache as_id in kvm_memory_slot Peter Xu
2020-01-09 14:57 ` [PATCH v3 08/21] KVM: X86: Drop x86_set_memory_region() Peter Xu
2020-01-09 14:57 ` [PATCH v3 09/21] KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] Peter Xu
2020-01-19  9:01   ` Paolo Bonzini
2020-01-20  6:45     ` Peter Xu
2020-01-21 15:56   ` Sean Christopherson
2020-01-21 16:14     ` Paolo Bonzini
2020-01-28  5:50     ` Peter Xu
2020-01-28 18:24       ` Sean Christopherson
2020-01-31 15:08         ` Peter Xu
2020-01-31 19:33           ` Sean Christopherson
2020-01-31 20:28             ` Peter Xu
2020-01-31 20:36               ` Sean Christopherson
2020-01-31 20:55                 ` Peter Xu
2020-01-31 21:29                   ` Sean Christopherson
2020-01-31 22:16                     ` Peter Xu
2020-01-31 22:20                       ` Sean Christopherson
2020-01-09 14:57 ` [PATCH v3 10/21] KVM: Pass in kvm pointer into mark_page_dirty_in_slot() Peter Xu
2020-01-09 14:57 ` [PATCH v3 11/21] KVM: Move running VCPU from ARM to common code Peter Xu
2020-01-09 14:57 ` [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2020-01-09 16:29   ` Michael S. Tsirkin
2020-01-09 16:56     ` Alex Williamson
2020-01-09 19:21       ` Peter Xu
2020-01-09 19:36         ` Michael S. Tsirkin
2020-01-09 19:15     ` Peter Xu
2020-01-09 19:35       ` Michael S. Tsirkin
2020-01-09 20:19         ` Peter Xu
2020-01-09 22:18           ` Michael S. Tsirkin
2020-01-10 15:29             ` Peter Xu
2020-01-12  6:24               ` Michael S. Tsirkin
2020-01-14 20:01         ` Peter Xu
2020-01-15  6:50           ` Michael S. Tsirkin
2020-01-15 15:20             ` Peter Xu
2020-01-19  9:09       ` Paolo Bonzini
2020-01-19 10:12         ` Michael S. Tsirkin
2020-01-20  7:29           ` Peter Xu
2020-01-20  7:47             ` Michael S. Tsirkin
2020-01-21  8:29               ` Peter Xu
2020-01-21 10:25                 ` Paolo Bonzini
2020-01-21 10:24             ` Paolo Bonzini
2020-01-11  4:49   ` kbuild test robot
2020-01-11 23:19   ` kbuild test robot
2020-01-15  6:47   ` Michael S. Tsirkin
2020-01-15 15:27     ` Peter Xu
2020-01-16  8:38   ` Michael S. Tsirkin
2020-01-16 16:27     ` Peter Xu
2020-01-17  9:50       ` Michael S. Tsirkin
2020-01-20  6:48         ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 13/21] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2020-01-09 14:57 ` [PATCH v3 14/21] KVM: Don't allocate dirty bitmap if dirty ring is enabled Peter Xu
2020-01-09 16:41   ` Peter Xu
2020-01-09 14:57 ` [PATCH v3 15/21] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2020-01-09 14:57 ` [PATCH v3 16/21] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2020-01-09 14:57 ` [PATCH v3 17/21] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2020-01-09 14:57 ` [PATCH v3 18/21] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2020-01-09 14:57 ` [PATCH v3 19/21] KVM: selftests: Add dirty ring buffer test Peter Xu
2020-01-09 14:57 ` [PATCH v3 20/21] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2020-01-09 14:57 ` [PATCH v3 21/21] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2020-01-09 15:59 ` [PATCH v3 00/21] KVM: Dirty ring interface Michael S. Tsirkin
2020-01-09 16:17   ` Peter Xu
2020-01-09 16:40     ` Michael S. Tsirkin
2020-01-09 17:08       ` Peter Xu
2020-01-09 19:08         ` Michael S. Tsirkin
2020-01-09 19:39           ` Peter Xu
2020-01-09 20:42             ` Paolo Bonzini
2020-01-09 22:28             ` Michael S. Tsirkin
2020-01-10 15:10               ` Peter Xu
2020-01-09 16:47 ` Alex Williamson
2020-01-09 17:58   ` Peter Xu
2020-01-09 19:13     ` Michael S. Tsirkin
2020-01-09 19:23       ` Peter Xu
2020-01-09 19:37         ` Michael S. Tsirkin
2020-01-09 20:51       ` Paolo Bonzini
2020-01-09 22:21         ` Michael S. Tsirkin
2020-01-19  9:11 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).