All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] kvm: implement atomic memslot updates
@ 2022-09-09 10:44 Emanuele Giuseppe Esposito
  2022-09-09 10:44 ` [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region Emanuele Giuseppe Esposito
                   ` (9 more replies)
  0 siblings, 10 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:44 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

KVM is currently capable of receiving a single memslot update through
the KVM_SET_USER_MEMORY_REGION ioctl.
The problem arises when we want to atomically perform multiple updates,
so that readers of memslot active list avoid seeing incomplete states.

For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
we see how non atomic updates cause boot failure, because vcpus
will se a partial update (old memslot delete, new one not yet created)
and will crash.

In this series we introduce KVM_SET_USER_MEMORY_REGION_LIST, a new ioctl
that takes a kvm_userspace_memory_region_list, a list of memslot updates,
and performs them atomically.
"atomically" in KVM words just means "apply all modifications to the
inactive memslot list, and then perform a single swap to replace the
active list with the inactive".
It is slightly more complicated that that, since DELETE and MOVE operations
require 2 swaps, but the main idea is the above.

Patch 1-6 are just code movements, in preparation for the following patches.
Patch 7 allows the invalid slot to be in both inactive and active memslot lists.
Patch 8 allows searching for the existing memslot (old) in the inactive list,
and not the active, allowing to perform multiple memslot updates without swapping.
Patch 9 implements IOCTL logic.

QEMU userspace logic in preparation for the IOCTL is here:
https://patchew.org/QEMU/20220909081150.709060-1-eesposit@redhat.com/
"[RFC PATCH v2 0/3] accel/kvm: extend kvm memory listener to support"

TODOs and ideas:
- limit the size of the ioctl arguments. Right now it is unbounded
- try to reduce the amount of swaps necessary? ie every DELETE/MOVE
  requires an additional swap
- add selftests
- add documentation

Emanuele Giuseppe Esposito (9):
  kvm_main.c: move slot check in kvm_set_memory_region
  kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl
  kvm_main.c: introduce kvm_internal_memory_region_list
  kvm_main.c: split logic in kvm_set_memslots
  kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and
    kvm_prepare_batch
  kvm_main.c: simplify change-specific callbacks
  kvm_main.c: duplicate invalid memslot also in inactive list
  kvm_main.c: find memslots from the inactive memslot list
  kvm_main.c: handle atomic memslot update

 arch/x86/kvm/x86.c       |   3 +-
 include/linux/kvm_host.h |  21 +-
 include/uapi/linux/kvm.h |  21 +-
 virt/kvm/kvm_main.c      | 420 +++++++++++++++++++++++++++++++--------
 4 files changed, 377 insertions(+), 88 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
@ 2022-09-09 10:44 ` Emanuele Giuseppe Esposito
  2022-09-28 16:41   ` Paolo Bonzini
  2022-09-09 10:44 ` [RFC PATCH 2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl Emanuele Giuseppe Esposito
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:44 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

And make kvm_set_memory_region static, since it is not used outside
kvm_main.c

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 include/linux/kvm_host.h |  2 --
 virt/kvm/kvm_main.c      | 11 +++++------
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b40f8d68fbb..1c5b7b2e35dd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1108,8 +1108,6 @@ enum kvm_mr_change {
 	KVM_MR_FLAGS_ONLY,
 };
 
-int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
 			    const struct kvm_userspace_memory_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index da263c370d00..339de0ed4557 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2007,24 +2007,23 @@ int __kvm_set_memory_region(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
-int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+static int kvm_set_memory_region(struct kvm *kvm,
+				 const struct kvm_userspace_memory_region *mem)
 {
 	int r;
 
+	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
+		return -EINVAL;
+
 	mutex_lock(&kvm->slots_lock);
 	r = __kvm_set_memory_region(kvm, mem);
 	mutex_unlock(&kvm->slots_lock);
 	return r;
 }
-EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
 					  struct kvm_userspace_memory_region *mem)
 {
-	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
-		return -EINVAL;
-
 	return kvm_set_memory_region(kvm, mem);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
  2022-09-09 10:44 ` [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region Emanuele Giuseppe Esposito
@ 2022-09-09 10:44 ` Emanuele Giuseppe Esposito
  2022-09-28 16:42   ` Paolo Bonzini
  2022-09-09 10:45 ` [RFC PATCH 3/9] kvm_main.c: introduce kvm_internal_memory_region_list Emanuele Giuseppe Esposito
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:44 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

This IOCTL enables atomic update of multiple memslots.
The userspace application provides a kvm_userspace_memory_region_list
containing a list of entries, each representing a modification to be
performed to a memslot.

Requests with invalidate_slot == 1 are pre-processed, because they
are ther DELETE or MOVE, and therefore the memslot must be first
replaced with a copy marked as KVM_MEMSLOT_INVALID, and then replaced.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 include/uapi/linux/kvm.h | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a36e78710382..673496b91a25 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,24 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION_LIST */
+struct kvm_userspace_memory_region_entry {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size; /* bytes */
+	__u64 userspace_addr; /* start of the userspace allocated memory */
+	__u8 invalidate_slot;
+	__u8 padding[31];
+};
+
+/* for KVM_SET_USER_MEMORY_REGION_LIST */
+struct kvm_userspace_memory_region_list {
+	__u32 nent;
+	__u32 flags;
+	struct kvm_userspace_memory_region_entry entries[0];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -1444,7 +1462,8 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
-
+#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
+					struct kvm_userspace_memory_region_list)
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
 	__u64 user_addr;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 3/9] kvm_main.c: introduce kvm_internal_memory_region_list
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
  2022-09-09 10:44 ` [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region Emanuele Giuseppe Esposito
  2022-09-09 10:44 ` [RFC PATCH 2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-28 16:48   ` Paolo Bonzini
  2022-09-09 10:45 ` [RFC PATCH 4/9] kvm_main.c: split logic in kvm_set_memslots Emanuele Giuseppe Esposito
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

For now this struct is only used to pass new,old and change
variable in a single parameter instead of three.

In future, it will be used to also carry additional information
and handle atomic memslot updates.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 arch/x86/kvm/x86.c       |  3 ++-
 include/linux/kvm_host.h | 15 +++++++++++-
 virt/kvm/kvm_main.c      | 52 +++++++++++++++++++++++++++-------------
 3 files changed, 51 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 567d13405445..da5a5dd3d4bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12155,13 +12155,14 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		struct kvm_userspace_memory_region m;
+		struct kvm_internal_memory_region_list b = { 0 };
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
 		m.guest_phys_addr = gpa;
 		m.userspace_addr = hva;
 		m.memory_size = size;
-		r = __kvm_set_memory_region(kvm, &m);
+		r = __kvm_set_memory_region(kvm, &m, &b);
 		if (r < 0)
 			return ERR_PTR_USR(r);
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c5b7b2e35dd..69af94472b39 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1108,8 +1108,21 @@ enum kvm_mr_change {
 	KVM_MR_FLAGS_ONLY,
 };
 
+/*
+ * Internally used to atomically update multiple memslots.
+ * Must be always zeroed by the caller.
+ */
+struct kvm_internal_memory_region_list {
+	/* Fields initialized in __kvm_set_memory_region() */
+	struct kvm_memory_slot *old;
+	struct kvm_memory_slot *new;
+	struct kvm_memory_slot *invalid;
+	enum kvm_mr_change change;
+};
+
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_userspace_memory_region *mem,
+			    struct kvm_internal_memory_region_list *batch);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 339de0ed4557..e4fab15d0d4b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1583,10 +1583,11 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 }
 
 static int kvm_prepare_memory_region(struct kvm *kvm,
-				     const struct kvm_memory_slot *old,
-				     struct kvm_memory_slot *new,
-				     enum kvm_mr_change change)
+				     struct kvm_internal_memory_region_list *batch)
 {
+	struct kvm_memory_slot *old = batch->old;
+	struct kvm_memory_slot *new = batch->new;
+	enum kvm_mr_change change = batch->change;
 	int r;
 
 	/*
@@ -1621,10 +1622,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 }
 
 static void kvm_commit_memory_region(struct kvm *kvm,
-				     struct kvm_memory_slot *old,
-				     const struct kvm_memory_slot *new,
-				     enum kvm_mr_change change)
+				     struct kvm_internal_memory_region_list *batch)
 {
+	struct kvm_memory_slot *old = batch->old;
+	struct kvm_memory_slot *new = batch->new;
+	enum kvm_mr_change change = batch->change;
+
 	/*
 	 * Update the total number of memslot pages before calling the arch
 	 * hook so that architectures can consume the result directly.
@@ -1788,11 +1791,12 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
 }
 
 static int kvm_set_memslot(struct kvm *kvm,
-			   struct kvm_memory_slot *old,
-			   struct kvm_memory_slot *new,
-			   enum kvm_mr_change change)
+			   struct kvm_internal_memory_region_list *batch)
 {
 	struct kvm_memory_slot *invalid_slot;
+	struct kvm_memory_slot *old = batch->old;
+	struct kvm_memory_slot *new = batch->new;
+	enum kvm_mr_change change = batch->change;
 	int r;
 
 	/*
@@ -1830,10 +1834,11 @@ static int kvm_set_memslot(struct kvm *kvm,
 			mutex_unlock(&kvm->slots_arch_lock);
 			return -ENOMEM;
 		}
+		batch->invalid = invalid_slot;
 		kvm_invalidate_memslot(kvm, old, invalid_slot);
 	}
 
-	r = kvm_prepare_memory_region(kvm, old, new, change);
+	r = kvm_prepare_memory_region(kvm, batch);
 	if (r) {
 		/*
 		 * For DELETE/MOVE, revert the above INVALID change.  No
@@ -1877,7 +1882,7 @@ static int kvm_set_memslot(struct kvm *kvm,
 	 * will directly hit the final, active memslot.  Architectures are
 	 * responsible for knowing that new->arch may be stale.
 	 */
-	kvm_commit_memory_region(kvm, old, new, change);
+	kvm_commit_memory_region(kvm, batch);
 
 	return 0;
 }
@@ -1900,11 +1905,14 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * space.
  *
  * Discontiguous memory is allowed, mostly for framebuffers.
+ * This function takes also care of initializing batch->new/old/invalid/change
+ * fields.
  *
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_userspace_memory_region *mem,
+			    struct kvm_internal_memory_region_list *batch)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -1947,6 +1955,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	 * and/or destroyed by kvm_set_memslot().
 	 */
 	old = id_to_memslot(slots, id);
+	batch->old = old;
 
 	if (!mem->memory_size) {
 		if (!old || !old->npages)
@@ -1955,7 +1964,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages))
 			return -EIO;
 
-		return kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE);
+		batch->change = KVM_MR_DELETE;
+		batch->new = NULL;
+		return kvm_set_memslot(kvm, batch);
 	}
 
 	base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
@@ -1963,6 +1974,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
 	if (!old || !old->npages) {
 		change = KVM_MR_CREATE;
+		batch->old = NULL;
 
 		/*
 		 * To simplify KVM internals, the total number of pages across
@@ -2000,7 +2012,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
 
-	r = kvm_set_memslot(kvm, old, new, change);
+	batch->new = new;
+	batch->change = change;
+
+	r = kvm_set_memslot(kvm, batch);
 	if (r)
 		kfree(new);
 	return r;
@@ -2008,7 +2023,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 static int kvm_set_memory_region(struct kvm *kvm,
-				 const struct kvm_userspace_memory_region *mem)
+				 const struct kvm_userspace_memory_region *mem,
+				 struct kvm_internal_memory_region_list *batch)
 {
 	int r;
 
@@ -2016,7 +2032,7 @@ static int kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 
 	mutex_lock(&kvm->slots_lock);
-	r = __kvm_set_memory_region(kvm, mem);
+	r = __kvm_set_memory_region(kvm, mem, batch);
 	mutex_unlock(&kvm->slots_lock);
 	return r;
 }
@@ -2024,7 +2040,9 @@ static int kvm_set_memory_region(struct kvm *kvm,
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
 					  struct kvm_userspace_memory_region *mem)
 {
-	return kvm_set_memory_region(kvm, mem);
+	struct kvm_internal_memory_region_list batch = { 0 };
+
+	return kvm_set_memory_region(kvm, mem, &batch);
 }
 
 #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 4/9] kvm_main.c: split logic in kvm_set_memslots
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (2 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 3/9] kvm_main.c: introduce kvm_internal_memory_region_list Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-28 17:04   ` Paolo Bonzini
  2022-09-09 10:45 ` [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch Emanuele Giuseppe Esposito
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

At this point it is also just a split, but later will handle atomic
memslot updates (thus avoiding swapping the memslot list every time).

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 virt/kvm/kvm_main.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4fab15d0d4b..17f07546d591 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1790,12 +1790,15 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
 	kvm_activate_memslot(kvm, old, new);
 }
 
-static int kvm_set_memslot(struct kvm *kvm,
-			   struct kvm_internal_memory_region_list *batch)
+/*
+ * Takes kvm->slots_arch_lock, and releases it only if
+ * invalid_slot allocation or kvm_prepare_memory_region failed.
+ */
+static int kvm_prepare_memslot(struct kvm *kvm,
+			       struct kvm_internal_memory_region_list *batch)
 {
 	struct kvm_memory_slot *invalid_slot;
 	struct kvm_memory_slot *old = batch->old;
-	struct kvm_memory_slot *new = batch->new;
 	enum kvm_mr_change change = batch->change;
 	int r;
 
@@ -1829,7 +1832,8 @@ static int kvm_set_memslot(struct kvm *kvm,
 	 * invalidation needs to be reverted.
 	 */
 	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
-		invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT);
+		invalid_slot = kzalloc(sizeof(*invalid_slot),
+				       GFP_KERNEL_ACCOUNT);
 		if (!invalid_slot) {
 			mutex_unlock(&kvm->slots_arch_lock);
 			return -ENOMEM;
@@ -1847,13 +1851,24 @@ static int kvm_set_memslot(struct kvm *kvm,
 		 * release slots_arch_lock.
 		 */
 		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+			/* kvm_activate_memslot releases kvm->slots_arch_lock */
 			kvm_activate_memslot(kvm, invalid_slot, old);
 			kfree(invalid_slot);
 		} else {
 			mutex_unlock(&kvm->slots_arch_lock);
 		}
-		return r;
 	}
+	return r;
+}
+
+/* Must be called with kvm->slots_arch_lock held, but releases it. */
+static void kvm_finish_memslot(struct kvm *kvm,
+			       struct kvm_internal_memory_region_list *batch)
+{
+	struct kvm_memory_slot *invalid_slot = batch->invalid;
+	struct kvm_memory_slot *old = batch->old;
+	struct kvm_memory_slot *new = batch->new;
+	enum kvm_mr_change change = batch->change;
 
 	/*
 	 * For DELETE and MOVE, the working slot is now active as the INVALID
@@ -1883,6 +1898,18 @@ static int kvm_set_memslot(struct kvm *kvm,
 	 * responsible for knowing that new->arch may be stale.
 	 */
 	kvm_commit_memory_region(kvm, batch);
+}
+
+static int kvm_set_memslot(struct kvm *kvm,
+			   struct kvm_internal_memory_region_list *batch)
+{
+	int r;
+
+	r = kvm_prepare_memslot(kvm, batch);
+	if (r)
+		return r;
+
+	kvm_finish_memslot(kvm, batch);
 
 	return 0;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (3 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 4/9] kvm_main.c: split logic in kvm_set_memslots Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-13  2:56   ` Yang, Weijiang
  2022-09-28 17:11   ` Paolo Bonzini
  2022-09-09 10:45 ` [RFC PATCH 6/9] kvm_main.c: simplify change-specific callbacks Emanuele Giuseppe Esposito
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

Just a function split. No functional change intended,
except for the fact that kvm_prepare_batch() does not
immediately call kvm_set_memslot() if batch->change is
KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 virt/kvm/kvm_main.c | 120 +++++++++++++++++++++++++++++---------------
 1 file changed, 79 insertions(+), 41 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 17f07546d591..9d917af30593 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1927,19 +1927,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
 	return false;
 }
 
-/*
- * Allocate some memory and give it an address in the guest physical address
- * space.
- *
- * Discontiguous memory is allowed, mostly for framebuffers.
- * This function takes also care of initializing batch->new/old/invalid/change
- * fields.
- *
- * Must be called holding kvm->slots_lock for write.
- */
-int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem,
-			    struct kvm_internal_memory_region_list *batch)
+static int kvm_prepare_batch(struct kvm *kvm,
+			     const struct kvm_userspace_memory_region *mem,
+			     struct kvm_internal_memory_region_list *batch)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -1947,34 +1937,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	unsigned long npages;
 	gfn_t base_gfn;
 	int as_id, id;
-	int r;
-
-	r = check_memory_region_flags(mem);
-	if (r)
-		return r;
 
 	as_id = mem->slot >> 16;
 	id = (u16)mem->slot;
 
-	/* General sanity checks */
-	if ((mem->memory_size & (PAGE_SIZE - 1)) ||
-	    (mem->memory_size != (unsigned long)mem->memory_size))
-		return -EINVAL;
-	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
-		return -EINVAL;
-	/* We can read the guest memory with __xxx_user() later on. */
-	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
-	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
-	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
-			mem->memory_size))
-		return -EINVAL;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
-		return -EINVAL;
-	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
-		return -EINVAL;
-	if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
-		return -EINVAL;
-
 	slots = __kvm_memslots(kvm, as_id);
 
 	/*
@@ -1993,7 +1959,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
 		batch->change = KVM_MR_DELETE;
 		batch->new = NULL;
-		return kvm_set_memslot(kvm, batch);
+		return 0;
 	}
 
 	base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
@@ -2020,7 +1986,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		else if (mem->flags != old->flags)
 			change = KVM_MR_FLAGS_ONLY;
 		else /* Nothing to change. */
-			return 0;
+			return 1;
 	}
 
 	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
@@ -2041,12 +2007,81 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
 	batch->new = new;
 	batch->change = change;
+	return 0;
+}
+
+static int kvm_check_mem(const struct kvm_userspace_memory_region *mem)
+{
+	int as_id, id;
+
+	as_id = mem->slot >> 16;
+	id = (u16)mem->slot;
+
+	/* General sanity checks */
+	if ((mem->memory_size & (PAGE_SIZE - 1)) ||
+	    (mem->memory_size != (unsigned long)mem->memory_size))
+		return -EINVAL;
+	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
+		return -EINVAL;
+	/* We can read the guest memory with __xxx_user() later on. */
+	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
+	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
+	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+			mem->memory_size))
+		return -EINVAL;
+	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+		return -EINVAL;
+	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
+		return -EINVAL;
+	if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int kvm_check_memory_region(struct kvm *kvm,
+				const struct kvm_userspace_memory_region *mem,
+				struct kvm_internal_memory_region_list *batch)
+{
+	int r;
+
+	r = check_memory_region_flags(mem);
+	if (r)
+		return r;
 
-	r = kvm_set_memslot(kvm, batch);
+	r = kvm_check_mem(mem);
 	if (r)
-		kfree(new);
+		return r;
+
+	r = kvm_prepare_batch(kvm, mem, batch);
+	if (r && batch->new)
+		kfree(batch->new);
+
 	return r;
 }
+
+/*
+ * Allocate some memory and give it an address in the guest physical address
+ * space.
+ *
+ * Discontiguous memory is allowed, mostly for framebuffers.
+ * This function takes also care of initializing batch->new/old/invalid/change
+ * fields.
+ *
+ * Must be called holding kvm->slots_lock for write.
+ */
+int __kvm_set_memory_region(struct kvm *kvm,
+			    const struct kvm_userspace_memory_region *mem,
+			    struct kvm_internal_memory_region_list *batch)
+{
+	int r;
+
+	r = kvm_check_memory_region(kvm, mem, batch);
+	if (r)
+		return r;
+
+	return kvm_set_memslot(kvm, batch);
+}
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 static int kvm_set_memory_region(struct kvm *kvm,
@@ -2061,6 +2096,9 @@ static int kvm_set_memory_region(struct kvm *kvm,
 	mutex_lock(&kvm->slots_lock);
 	r = __kvm_set_memory_region(kvm, mem, batch);
 	mutex_unlock(&kvm->slots_lock);
+	/* r == 1 means empty request, nothing to do but no error */
+	if (r == 1)
+		r = 0;
 	return r;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 6/9] kvm_main.c: simplify change-specific callbacks
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (4 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-09 10:45 ` [RFC PATCH 7/9] kvm_main.c: duplicate invalid memslot also in inactive list Emanuele Giuseppe Esposito
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

Instead of replacing the memslot in the inactive slots and
activate for each "change" specific function (NEW, MOVE, DELETE, ...),
make kvm_set_memslot() replace the memslot in current inactive list
and swap the lists, and then kvm_finish_memslot just takes care of
updating the new inactive list (was active).

We can generalize here the pre-swap replacement with
replace(old, new) because even if in a DELETE or MOVE operation,
old will always stay in the inactive list (used by kvm_replace_memslot).

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 virt/kvm/kvm_main.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9d917af30593..6b73615891f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1747,34 +1747,26 @@ static void kvm_invalidate_memslot(struct kvm *kvm,
 static void kvm_create_memslot(struct kvm *kvm,
 			       struct kvm_memory_slot *new)
 {
-	/* Add the new memslot to the inactive set and activate. */
+	/* Update inactive slot (was active) by adding the new slot */
 	kvm_replace_memslot(kvm, NULL, new);
-	kvm_activate_memslot(kvm, NULL, new);
 }
 
 static void kvm_delete_memslot(struct kvm *kvm,
-			       struct kvm_memory_slot *old,
 			       struct kvm_memory_slot *invalid_slot)
 {
-	/*
-	 * Remove the old memslot (in the inactive memslots) by passing NULL as
-	 * the "new" slot, and for the invalid version in the active slots.
-	 */
-	kvm_replace_memslot(kvm, old, NULL);
-	kvm_activate_memslot(kvm, invalid_slot, NULL);
+	/* Update inactive slot (was active) by removing the invalid slot */
+	kvm_replace_memslot(kvm, invalid_slot, NULL);
 }
 
 static void kvm_move_memslot(struct kvm *kvm,
-			     struct kvm_memory_slot *old,
 			     struct kvm_memory_slot *new,
 			     struct kvm_memory_slot *invalid_slot)
 {
 	/*
-	 * Replace the old memslot in the inactive slots, and then swap slots
-	 * and replace the current INVALID with the new as well.
+	 * Update inactive slot (was active) by removing the invalid slot
+	 * and adding the new one.
 	 */
-	kvm_replace_memslot(kvm, old, new);
-	kvm_activate_memslot(kvm, invalid_slot, new);
+	kvm_replace_memslot(kvm, invalid_slot, new);
 }
 
 static void kvm_update_flags_memslot(struct kvm *kvm,
@@ -1782,12 +1774,10 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
 				     struct kvm_memory_slot *new)
 {
 	/*
-	 * Similar to the MOVE case, but the slot doesn't need to be zapped as
-	 * an intermediate step. Instead, the old memslot is simply replaced
-	 * with a new, updated copy in both memslot sets.
+	 * Update inactive slot (was active) by removing the old slot
+	 * and adding the new one.
 	 */
 	kvm_replace_memslot(kvm, old, new);
-	kvm_activate_memslot(kvm, old, new);
 }
 
 /*
@@ -1880,9 +1870,9 @@ static void kvm_finish_memslot(struct kvm *kvm,
 	if (change == KVM_MR_CREATE)
 		kvm_create_memslot(kvm, new);
 	else if (change == KVM_MR_DELETE)
-		kvm_delete_memslot(kvm, old, invalid_slot);
+		kvm_delete_memslot(kvm, invalid_slot);
 	else if (change == KVM_MR_MOVE)
-		kvm_move_memslot(kvm, old, new, invalid_slot);
+		kvm_move_memslot(kvm, new, invalid_slot);
 	else if (change == KVM_MR_FLAGS_ONLY)
 		kvm_update_flags_memslot(kvm, old, new);
 	else
@@ -1903,12 +1893,24 @@ static void kvm_finish_memslot(struct kvm *kvm,
 static int kvm_set_memslot(struct kvm *kvm,
 			   struct kvm_internal_memory_region_list *batch)
 {
-	int r;
+	int r, as_id;
 
 	r = kvm_prepare_memslot(kvm, batch);
 	if (r)
 		return r;
 
+	/*
+	 * if change is DELETE or MOVE, invalid is in active memslots
+	 * and old in inactive, so replace old with new.
+	 */
+	kvm_replace_memslot(kvm, batch->old, batch->new);
+
+	/* either old or invalid is the same, since invalid is old's copy */
+	as_id = kvm_memslots_get_as_id(batch->old, batch->new);
+
+	/* releases kvm->slots_arch_lock */
+	kvm_swap_active_memslots(kvm, as_id);
+
 	kvm_finish_memslot(kvm, batch);
 
 	return 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 7/9] kvm_main.c: duplicate invalid memslot also in inactive list
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (5 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 6/9] kvm_main.c: simplify change-specific callbacks Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-28 17:18   ` Paolo Bonzini
  2022-09-09 10:45 ` [RFC PATCH 8/9] kvm_main.c: find memslots from the inactive memslot list Emanuele Giuseppe Esposito
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

In preparation for atomic memslot updates, make sure the
invalid memslot is also replacing the old one in the inactive list.

This implies that once we want to insert the new slot for a MOVE,
or simply delete the existing one for a DELETE,
we need to remove the "invalid" slot, not the "old" one.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 virt/kvm/kvm_main.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b73615891f0..31e46f9a06fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1830,6 +1830,7 @@ static int kvm_prepare_memslot(struct kvm *kvm,
 		}
 		batch->invalid = invalid_slot;
 		kvm_invalidate_memslot(kvm, old, invalid_slot);
+		kvm_replace_memslot(kvm, old, invalid_slot);
 	}
 
 	r = kvm_prepare_memory_region(kvm, batch);
@@ -1900,10 +1901,14 @@ static int kvm_set_memslot(struct kvm *kvm,
 		return r;
 
 	/*
-	 * if change is DELETE or MOVE, invalid is in active memslots
-	 * and old in inactive, so replace old with new.
+	 * if change is DELETE or MOVE, invalid is in both active and inactive
+	 * memslot list. This means that we don't need old anymore, and
+	 * we should replace invalid with new.
 	 */
-	kvm_replace_memslot(kvm, batch->old, batch->new);
+	if (batch->change == KVM_MR_DELETE || batch->change == KVM_MR_MOVE)
+		kvm_replace_memslot(kvm, batch->invalid, batch->new);
+	else
+		kvm_replace_memslot(kvm, batch->old, batch->new);
 
 	/* either old or invalid is the same, since invalid is old's copy */
 	as_id = kvm_memslots_get_as_id(batch->old, batch->new);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 8/9] kvm_main.c: find memslots from the inactive memslot list
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (6 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 7/9] kvm_main.c: duplicate invalid memslot also in inactive list Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-09 10:45 ` [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update Emanuele Giuseppe Esposito
  2022-09-09 14:30 ` [RFC PATCH 0/9] kvm: implement atomic memslot updates Sean Christopherson
  9 siblings, 0 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

Instead of looking at the active list, look at the inactive.
This causes no harm to the current code, as active and inactive
lists are identical at this point.

In addition, provide flexibility for atomic memslot
updates, because in that case we want to perform multiple
updates in the inactive list first, and then perform a single
swap only. If we were to use the active list, previous updates
that were not yet swapped won't be seen, and the following logic
in kvm_prepare_batch() could for example find an old memslot
that was deleted in the inactive but not in the active, thus
wrongly assuming that the coming request is a MOVE and not a CREATE.

Note that this also causes no harm to the invalidate memslot, since
we are already inserting it as replacement in both active and inactive
list.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 31e46f9a06fa..ecd43560281c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1948,7 +1948,7 @@ static int kvm_prepare_batch(struct kvm *kvm,
 	as_id = mem->slot >> 16;
 	id = (u16)mem->slot;
 
-	slots = __kvm_memslots(kvm, as_id);
+	slots = kvm_get_inactive_memslots(kvm, as_id);
 
 	/*
 	 * Note, the old memslot (and the pointer itself!) may be invalidated
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (7 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 8/9] kvm_main.c: find memslots from the inactive memslot list Emanuele Giuseppe Esposito
@ 2022-09-09 10:45 ` Emanuele Giuseppe Esposito
  2022-09-13  2:30   ` Yang, Weijiang
                     ` (2 more replies)
  2022-09-09 14:30 ` [RFC PATCH 0/9] kvm: implement atomic memslot updates Sean Christopherson
  9 siblings, 3 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09 10:45 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, Emanuele Giuseppe Esposito

When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
to make sure that all memslots are updated in the inactive list
and then swap (preferreably only once) the lists, so that all
changes are visible immediately.

The only issue is that DELETE and MOVE need to perform 2 swaps:
firstly replace old memslot with invalid, and then remove invalid.

Therefore we need to perform two passes in the memslot list provided
by the ioctl: first handle only the DELETE and MOVE memslots, by
replacing the old node with an invalid one. This implies a swap for
each memslot (they could also be grouped in a single one, but for
simplicity we are not going to do it).
Userspace already marks the DELETE and MOVE requests with the flag
invalid_slot == 1.

Then, scan again the list and update the inactive memslot list.
Once the inactive memslot list is ready, swap it only once per
address space, and conclude the memslot handling.

Regarding kvm->slots_arch_lock, it is always taken in
kvm_prepare_memslot but depending on the batch->change and loop (first,
second, conclusion) in kvm_vm_ioctl_set_memory_region_list,
we release in different places:

- change = DELETE or MOVE. In the first pass, we release after creating
the invalid memslot and swapping.

- Second loop in kvm_vm_ioctl_set_memory_region_list.
We release it in kvm_set_memslot since batch->batch is true.

- Third loop in kvm_vm_ioctl_set_memory_region_list.
We take and release it for each swap.

- Call from kvm_vm_ioctl_set_memory_region: as before this patch,
acquire in kvm_prepare_memslot and release in kvm_swap_active_memslots.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 include/linux/kvm_host.h |   4 +
 virt/kvm/kvm_main.c      | 185 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 179 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 69af94472b39..753650145836 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1118,6 +1118,10 @@ struct kvm_internal_memory_region_list {
 	struct kvm_memory_slot *new;
 	struct kvm_memory_slot *invalid;
 	enum kvm_mr_change change;
+
+	/* Fields that need to be set by the caller */
+	bool batch;
+	bool is_move_delete;
 };
 
 int __kvm_set_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ecd43560281c..85ba05924f0c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1782,7 +1782,8 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
 
 /*
  * Takes kvm->slots_arch_lock, and releases it only if
- * invalid_slot allocation or kvm_prepare_memory_region failed.
+ * invalid_slot allocation, kvm_prepare_memory_region failed
+ * or batch->is_move_delete is true.
  */
 static int kvm_prepare_memslot(struct kvm *kvm,
 			       struct kvm_internal_memory_region_list *batch)
@@ -1822,15 +1823,26 @@ static int kvm_prepare_memslot(struct kvm *kvm,
 	 * invalidation needs to be reverted.
 	 */
 	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
-		invalid_slot = kzalloc(sizeof(*invalid_slot),
+		if (!batch->invalid) {
+			invalid_slot = kzalloc(sizeof(*invalid_slot),
 				       GFP_KERNEL_ACCOUNT);
-		if (!invalid_slot) {
+			if (!invalid_slot) {
+				mutex_unlock(&kvm->slots_arch_lock);
+				return -ENOMEM;
+			}
+			batch->invalid = invalid_slot;
+			kvm_invalidate_memslot(kvm, old, invalid_slot);
+			kvm_replace_memslot(kvm, old, invalid_slot);
+		}
+
+		/*
+		 * We are in first pass of kvm_vm_ioctl_set_memory_region_list()
+		 * just take care of making the old slot invalid and visible.
+		 */
+		if (batch->is_move_delete) {
 			mutex_unlock(&kvm->slots_arch_lock);
-			return -ENOMEM;
+			return 0;
 		}
-		batch->invalid = invalid_slot;
-		kvm_invalidate_memslot(kvm, old, invalid_slot);
-		kvm_replace_memslot(kvm, old, invalid_slot);
 	}
 
 	r = kvm_prepare_memory_region(kvm, batch);
@@ -1896,6 +1908,13 @@ static int kvm_set_memslot(struct kvm *kvm,
 {
 	int r, as_id;
 
+	/*
+	 * First, prepare memslot. If DELETE and MOVE, take care of creating
+	 * the invalid slot and use it to replace the old one.
+	 * In order to keep things simple, allow each single update
+	 * to be immediately visibile by swapping the active and inactive
+	 * memory slot arrays.
+	 */
 	r = kvm_prepare_memslot(kvm, batch);
 	if (r)
 		return r;
@@ -1910,6 +1929,12 @@ static int kvm_set_memslot(struct kvm *kvm,
 	else
 		kvm_replace_memslot(kvm, batch->old, batch->new);
 
+	/* Caller has to manually commit changes afterwards */
+	if (batch->batch) {
+		mutex_unlock(&kvm->slots_arch_lock);
+		return r;
+	}
+
 	/* either old or invalid is the same, since invalid is old's copy */
 	as_id = kvm_memslots_get_as_id(batch->old, batch->new);
 
@@ -2083,9 +2108,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
 {
 	int r;
 
-	r = kvm_check_memory_region(kvm, mem, batch);
-	if (r)
-		return r;
+	if (!batch->old && !batch->new && !batch->invalid) {
+		r = kvm_check_memory_region(kvm, mem, batch);
+		if (r)
+			return r;
+	}
 
 	return kvm_set_memslot(kvm, batch);
 }
@@ -2117,6 +2144,133 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
 	return kvm_set_memory_region(kvm, mem, &batch);
 }
 
+static int kvm_vm_ioctl_set_memory_region_list(struct kvm *kvm,
+		       struct kvm_userspace_memory_region_list *list,
+		       struct kvm_userspace_memory_region_entry __user *mem_arg)
+{
+	struct kvm_userspace_memory_region_entry *mem, *m_iter;
+	struct kvm_userspace_memory_region *mem_region;
+	struct kvm_internal_memory_region_list *batch, *b_iter;
+	int i, r = 0;
+	bool *as_to_swap;
+
+	/* TODO: limit the number of mem to a max? */
+
+	if (!list->nent)
+		return r;
+
+	mem = vmemdup_user(mem_arg, array_size(sizeof(*mem), list->nent));
+	if (IS_ERR(mem)) {
+		r = PTR_ERR(mem);
+		goto out;
+	}
+
+	batch = kcalloc(list->nent, sizeof(*batch), GFP_KERNEL_ACCOUNT);
+	if (IS_ERR(batch)) {
+		r = PTR_ERR(batch);
+		goto out2;
+	}
+
+	as_to_swap = kcalloc(KVM_ADDRESS_SPACE_NUM, sizeof(bool),
+			     GFP_KERNEL_ACCOUNT);
+	if (IS_ERR(as_to_swap)) {
+		r = PTR_ERR(as_to_swap);
+		goto out3;
+	}
+
+	m_iter = mem;
+	b_iter = batch;
+	/*
+	 * First pass: handle all DELETE and MOVE requests
+	 * (since they swap active and inactive memslots)
+	 * by switching old memslot in invalid.
+	 */
+	mutex_lock(&kvm->slots_lock);
+	for (i = 0; i < list->nent; i++) {
+
+		if ((u16)m_iter->slot >= KVM_USER_MEM_SLOTS) {
+			r = -EINVAL;
+			goto out4;
+		}
+
+		if (!m_iter->invalidate_slot) {
+			m_iter++;
+			b_iter++;
+			continue;
+		}
+
+		mem_region = (struct kvm_userspace_memory_region *) m_iter;
+		r = kvm_check_memory_region(kvm, mem_region, b_iter);
+		if (r) {
+			mutex_unlock(&kvm->slots_lock);
+			goto out4;
+		}
+
+		WARN_ON(b_iter->change != KVM_MR_DELETE &&
+			b_iter->change != KVM_MR_MOVE);
+
+		b_iter->is_move_delete = true;
+		r = kvm_prepare_memslot(kvm, b_iter);
+		b_iter->is_move_delete = false;
+		if (r < 0) {
+			mutex_unlock(&kvm->slots_lock);
+			goto out4;
+		}
+		m_iter++;
+		b_iter++;
+	}
+	mutex_unlock(&kvm->slots_lock);
+
+
+	m_iter = mem;
+	b_iter = batch;
+	/*
+	 * Second pass: handle all other request types
+	 * but do not swap the memslots array yet.
+	 */
+	for (i = 0; i < list->nent; i++) {
+		b_iter->batch = true;
+		as_to_swap[m_iter->slot >> 16] = true;
+		mem_region = (struct kvm_userspace_memory_region *) m_iter;
+		r = kvm_set_memory_region(kvm, mem_region, b_iter);
+		if (r < 0)
+			goto out4;
+		m_iter++;
+		b_iter++;
+	}
+
+	/*
+	 * Conclude by swapping the memslot lists and updating the inactive set too.
+	 */
+	b_iter = batch;
+	mutex_lock(&kvm->slots_lock);
+	mutex_lock(&kvm->slots_arch_lock);
+	for (i = 0; i < list->nent; i++) {
+		int as_id;
+
+		as_id = kvm_memslots_get_as_id(b_iter->old, b_iter->new);
+		if (as_to_swap[as_id]) {
+			kvm_swap_active_memslots(kvm, as_id);
+			mutex_lock(&kvm->slots_arch_lock);
+			as_to_swap[as_id] = false;
+		}
+
+		kvm_finish_memslot(kvm, b_iter);
+		b_iter++;
+	}
+	mutex_unlock(&kvm->slots_arch_lock);
+	mutex_unlock(&kvm->slots_lock);
+
+out4:
+	kfree(as_to_swap);
+out3:
+	kfree(batch);
+out2:
+	kvfree(mem);
+out:
+	return r;
+}
+
 #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
 /**
  * kvm_get_dirty_log - get a snapshot of dirty pages
@@ -4732,6 +4886,17 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
 		break;
 	}
+	case KVM_SET_USER_MEMORY_REGION_LIST: {
+		struct kvm_userspace_memory_region_list __user *mem_arg = argp;
+		struct kvm_userspace_memory_region_list mem;
+
+		r = -EFAULT;
+		if (copy_from_user(&mem, mem_arg, sizeof(mem)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_memory_region_list(kvm, &mem, mem_arg->entries);
+		break;
+	}
 	case KVM_GET_DIRTY_LOG: {
 		struct kvm_dirty_log log;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
                   ` (8 preceding siblings ...)
  2022-09-09 10:45 ` [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update Emanuele Giuseppe Esposito
@ 2022-09-09 14:30 ` Sean Christopherson
  2022-09-18 16:13   ` Emanuele Giuseppe Esposito
  9 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-09-09 14:30 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
> KVM is currently capable of receiving a single memslot update through
> the KVM_SET_USER_MEMORY_REGION ioctl.
> The problem arises when we want to atomically perform multiple updates,
> so that readers of memslot active list avoid seeing incomplete states.
> 
> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276

I don't have access.  Can you provide a TL;DR?

> we see how non atomic updates cause boot failure, because vcpus
> will se a partial update (old memslot delete, new one not yet created)
> and will crash.

Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
to take on for something that appears to be solvable in userspace. 

And if the issue is related to KVM disallowing the toggling of read-only (can't see
the bug), we can likely solve that without needing a new ioctl() that allows
userspace to batch an arbitrary number of updates.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-09 10:45 ` [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update Emanuele Giuseppe Esposito
@ 2022-09-13  2:30   ` Yang, Weijiang
  2022-09-18 16:18     ` Emanuele Giuseppe Esposito
  2022-09-27  7:46   ` David Hildenbrand
  2022-09-28 17:29   ` Paolo Bonzini
  2 siblings, 1 reply; 58+ messages in thread
From: Yang, Weijiang @ 2022-09-13  2:30 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel


On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
> to make sure that all memslots are updated in the inactive list
> and then swap (preferreably only once) the lists, so that all
> changes are visible immediately.
[...]
> +static int kvm_vm_ioctl_set_memory_region_list(struct kvm *kvm,
> +		       struct kvm_userspace_memory_region_list *list,
> +		       struct kvm_userspace_memory_region_entry __user *mem_arg)
> +{
> +	struct kvm_userspace_memory_region_entry *mem, *m_iter;
> +	struct kvm_userspace_memory_region *mem_region;
> +	struct kvm_internal_memory_region_list *batch, *b_iter;
> +	int i, r = 0;
> +	bool *as_to_swap;
> +
> +	/* TODO: limit the number of mem to a max? */
> +
> +	if (!list->nent)
> +		return r;
> +
> +	mem = vmemdup_user(mem_arg, array_size(sizeof(*mem), list->nent));
> +	if (IS_ERR(mem)) {
> +		r = PTR_ERR(mem);
> +		goto out;
> +	}

IMO, it's more natural to dup the user memory at the first place, i.e., 
kvm_vm_ioctl,

it also makes the outlets shorter.


[...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch
  2022-09-09 10:45 ` [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch Emanuele Giuseppe Esposito
@ 2022-09-13  2:56   ` Yang, Weijiang
  2022-09-18 16:22     ` Emanuele Giuseppe Esposito
  2022-09-28 17:11   ` Paolo Bonzini
  1 sibling, 1 reply; 58+ messages in thread
From: Yang, Weijiang @ 2022-09-13  2:56 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, kvm


On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
> Just a function split. No functional change intended,
> except for the fact that kvm_prepare_batch() does not
> immediately call kvm_set_memslot() if batch->change is
> KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
>
>
> +static int kvm_check_memory_region(struct kvm *kvm,
> +				const struct kvm_userspace_memory_region *mem,
> +				struct kvm_internal_memory_region_list *batch)
> +{
> +	int r;
> +
> +	r = check_memory_region_flags(mem);
> +	if (r)
> +		return r;
>   
> -	r = kvm_set_memslot(kvm, batch);
> +	r = kvm_check_mem(mem);
>   	if (r)
> -		kfree(new);
> +		return r;
> +
> +	r = kvm_prepare_batch(kvm, mem, batch);
> +	if (r && batch->new)
> +		kfree(batch->new);
 From the patch, r !=0 and batch->new !=NULL are exclusive, so free() 
here is not reachable.
> +
>   	return r;
>   }
[...]
>   

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-09 14:30 ` [RFC PATCH 0/9] kvm: implement atomic memslot updates Sean Christopherson
@ 2022-09-18 16:13   ` Emanuele Giuseppe Esposito
  2022-09-19  7:38     ` Like Xu
  2022-09-19  7:53     ` David Hildenbrand
  0 siblings, 2 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-18 16:13 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-kernel



Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>> KVM is currently capable of receiving a single memslot update through
>> the KVM_SET_USER_MEMORY_REGION ioctl.
>> The problem arises when we want to atomically perform multiple updates,
>> so that readers of memslot active list avoid seeing incomplete states.
>>
>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
> 
> I don't have access.  Can you provide a TL;DR?

You should be able to have access to it now.

> 
>> we see how non atomic updates cause boot failure, because vcpus
>> will se a partial update (old memslot delete, new one not yet created)
>> and will crash.
> 
> Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
> to take on for something that appears to be solvable in userspace. 
> 

I think it is not that easy to solve in userspace: see
https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/


"Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
temporarily drop the BQL - something most callers can't handle (esp.
when called from vcpu context e.g., in virtio code)."

Probably @Paolo and @Maxim can add more to this.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-13  2:30   ` Yang, Weijiang
@ 2022-09-18 16:18     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-18 16:18 UTC (permalink / raw)
  To: Yang, Weijiang, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel



Am 13/09/2022 um 04:30 schrieb Yang, Weijiang:
> 
> On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>> to make sure that all memslots are updated in the inactive list
>> and then swap (preferreably only once) the lists, so that all
>> changes are visible immediately.
> [...]
>> +static int kvm_vm_ioctl_set_memory_region_list(struct kvm *kvm,
>> +               struct kvm_userspace_memory_region_list *list,
>> +               struct kvm_userspace_memory_region_entry __user *mem_arg)
>> +{
>> +    struct kvm_userspace_memory_region_entry *mem, *m_iter;
>> +    struct kvm_userspace_memory_region *mem_region;
>> +    struct kvm_internal_memory_region_list *batch, *b_iter;
>> +    int i, r = 0;
>> +    bool *as_to_swap;
>> +
>> +    /* TODO: limit the number of mem to a max? */
>> +
>> +    if (!list->nent)
>> +        return r;
>> +
>> +    mem = vmemdup_user(mem_arg, array_size(sizeof(*mem), list->nent));
>> +    if (IS_ERR(mem)) {
>> +        r = PTR_ERR(mem);
>> +        goto out;
>> +    }
> 
> IMO, it's more natural to dup the user memory at the first place, i.e.,
> kvm_vm_ioctl,
> 
> it also makes the outlets shorter.
> 

I followed the same pattern as kvm_vcpu_ioctl_set_cpuid2, which performs
the user memory dup inside the call :)

I see your point but I guess it's better to keep all ioctl
implementations similar.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch
  2022-09-13  2:56   ` Yang, Weijiang
@ 2022-09-18 16:22     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-18 16:22 UTC (permalink / raw)
  To: Yang, Weijiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, Maxim Levitsky,
	x86, H. Peter Anvin, linux-kernel, kvm



Am 13/09/2022 um 04:56 schrieb Yang, Weijiang:
> 
> On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
>> Just a function split. No functional change intended,
>> except for the fact that kvm_prepare_batch() does not
>> immediately call kvm_set_memslot() if batch->change is
>> KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
>>
>>
>> +static int kvm_check_memory_region(struct kvm *kvm,
>> +                const struct kvm_userspace_memory_region *mem,
>> +                struct kvm_internal_memory_region_list *batch)
>> +{
>> +    int r;
>> +
>> +    r = check_memory_region_flags(mem);
>> +    if (r)
>> +        return r;
>>   -    r = kvm_set_memslot(kvm, batch);
>> +    r = kvm_check_mem(mem);
>>       if (r)
>> -        kfree(new);
>> +        return r;
>> +
>> +    r = kvm_prepare_batch(kvm, mem, batch);
>> +    if (r && batch->new)
>> +        kfree(batch->new);
> From the patch, r !=0 and batch->new !=NULL are exclusive, so free()
> here is not reachable.

Good point, I'll get rid of this.

Thank you,
Emanuele
>> +
>>       return r;
>>   }
> [...]
>>   
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-18 16:13   ` Emanuele Giuseppe Esposito
@ 2022-09-19  7:38     ` Like Xu
  2022-09-19  7:53     ` David Hildenbrand
  1 sibling, 0 replies; 58+ messages in thread
From: Like Xu @ 2022-09-19  7:38 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-kernel,
	Maxim Levitsky, Paolo Bonzini

On 19/9/2022 12:13 am, Emanuele Giuseppe Esposito wrote:
> 
> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>> KVM is currently capable of receiving a single memslot update through
>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>> The problem arises when we want to atomically perform multiple updates,
>>> so that readers of memslot active list avoid seeing incomplete states.
>>>
>>> For example, in RHBZhttps://bugzilla.redhat.com/show_bug.cgi?id=1979276

Oh, thanks for stepping up to try to address it.

As it turns out, this issue was discovered "long" before
https://bugzilla.kernel.org/show_bug.cgi?id=213781

As a comment, relevant selftests are necessary and required.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-18 16:13   ` Emanuele Giuseppe Esposito
  2022-09-19  7:38     ` Like Xu
@ 2022-09-19  7:53     ` David Hildenbrand
  2022-09-19 17:30       ` David Hildenbrand
  1 sibling, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-09-19  7:53 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini,
	Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel

On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>> KVM is currently capable of receiving a single memslot update through
>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>> The problem arises when we want to atomically perform multiple updates,
>>> so that readers of memslot active list avoid seeing incomplete states.
>>>
>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>
>> I don't have access.  Can you provide a TL;DR?
> 
> You should be able to have access to it now.
> 
>>
>>> we see how non atomic updates cause boot failure, because vcpus
>>> will se a partial update (old memslot delete, new one not yet created)
>>> and will crash.
>>
>> Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
>> to take on for something that appears to be solvable in userspace.
>>
> 
> I think it is not that easy to solve in userspace: see
> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
> 
> 
> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
> temporarily drop the BQL - something most callers can't handle (esp.
> when called from vcpu context e.g., in virtio code)."

Can you please comment on the bigger picture? The patch from me works 
around *exactly that*, and for that reason, contains that comment.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-19  7:53     ` David Hildenbrand
@ 2022-09-19 17:30       ` David Hildenbrand
  2022-09-23 13:10         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-09-19 17:30 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini,
	Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On 19.09.22 09:53, David Hildenbrand wrote:
> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>> KVM is currently capable of receiving a single memslot update through
>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>> The problem arises when we want to atomically perform multiple updates,
>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>
>>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>
>>> I don't have access.  Can you provide a TL;DR?
>>
>> You should be able to have access to it now.
>>
>>>
>>>> we see how non atomic updates cause boot failure, because vcpus
>>>> will se a partial update (old memslot delete, new one not yet created)
>>>> and will crash.
>>>
>>> Why not simply pause vCPUs in this scenario?  This is an awful lot of a complexity
>>> to take on for something that appears to be solvable in userspace.
>>>
>>
>> I think it is not that easy to solve in userspace: see
>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>
>>
>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>> temporarily drop the BQL - something most callers can't handle (esp.
>> when called from vcpu context e.g., in virtio code)."
> 
> Can you please comment on the bigger picture? The patch from me works
> around *exactly that*, and for that reason, contains that comment.
> 

FWIW, I hacked up my RFC to perform atomic updates on any memslot 
transactions (not just resizes) where ranges do add overlap with ranges 
to remove.

https://github.com/davidhildenbrand/qemu/tree/memslot


I only performed simple boot check under x86-64 (where I can see region 
resizes) and some make checks -- pretty sure it has some rough edges; 
but should indicate what's possible and what the possible price might 
be. [one could wire up a new KVM ioctl and call it conditionally on 
support if really required]



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-19 17:30       ` David Hildenbrand
@ 2022-09-23 13:10         ` Emanuele Giuseppe Esposito
  2022-09-23 13:21           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-23 13:10 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Paolo Bonzini, Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu



Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
> On 19.09.22 09:53, David Hildenbrand wrote:
>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>> KVM is currently capable of receiving a single memslot update through
>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>> The problem arises when we want to atomically perform multiple
>>>>> updates,
>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>
>>>>> For example, in RHBZ
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>
>>>> I don't have access.  Can you provide a TL;DR?
>>>
>>> You should be able to have access to it now.
>>>
>>>>
>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>> and will crash.
>>>>
>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>> of a complexity
>>>> to take on for something that appears to be solvable in userspace.
>>>>
>>>
>>> I think it is not that easy to solve in userspace: see
>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>
>>>
>>>
>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>> temporarily drop the BQL - something most callers can't handle (esp.
>>> when called from vcpu context e.g., in virtio code)."
>>
>> Can you please comment on the bigger picture? The patch from me works
>> around *exactly that*, and for that reason, contains that comment.
>>
> 
> FWIW, I hacked up my RFC to perform atomic updates on any memslot
> transactions (not just resizes) where ranges do add overlap with ranges
> to remove.
> 
> https://github.com/davidhildenbrand/qemu/tree/memslot
> 
> 
> I only performed simple boot check under x86-64 (where I can see region
> resizes) and some make checks -- pretty sure it has some rough edges;
> but should indicate what's possible and what the possible price might
> be. [one could wire up a new KVM ioctl and call it conditionally on
> support if really required]
> 

A benefit of my ioctl implementation is that could be also used by other
hypervisors, which then do not need to share this kind of "hack".
However, after also talking with Maxim and Paolo, we all agreed that the
main disadvantage of your approach is that is not scalable with the
number of vcpus. It is also inferior to stop *all* vcpus just to allow a
memslot update (KVM only pauses vCPUs that access the modified memslots
instead).

So I took some measurements, to see what is the performance difference
between my implementation and yours. I used a machine where I could
replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
Processor with kernel 5.19.0 (+ my patches).

Then I measured the time it takes that QEMU spends in kvm_commit (ie in
memslot updates) while booting a VM. In other words, if kvm_commit takes
10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
not called anymore after boot, so this measurement is easy to compare
over multiple invocations of QEMU.

I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
command is the same to replicate the bug:
./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none >> ~/smp_$v;

Each boot is reproduced 100 times, and then from results I measure
average and stddev (in milliseconds).

ioctl:
-smp 1:        Average: 2.1ms        Stdev: 0.8ms
-smp 2:        Average: 2.5ms        Stdev: 1.5ms
-smp 4:        Average: 2.2ms        Stdev: 1.1ms
-smp 8:        Average: 2.4ms        Stdev: 0.7ms
-smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
-smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)


pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
-smp 1:        Average: 2.2ms        Stdev: 1.2ms
-smp 2:        Average: 3.0ms        Stdev: 1.4ms
-smp 4:        Average: 3.1ms        Stdev: 1.3m
-smp 8:        Average: 3.4ms        Stdev: 1.4ms
-smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
-smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)


Above 24 vCPUs performance gets worse quickly but I think it's already
quite clear that the results for ioctl scale better as the number of
vcpus increases, while pausing the vCPUs becomes slower already with 16
vcpus.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-23 13:10         ` Emanuele Giuseppe Esposito
@ 2022-09-23 13:21           ` David Hildenbrand
  2022-09-23 13:38             ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-09-23 13:21 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini,
	Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>> On 19.09.22 09:53, David Hildenbrand wrote:
>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>> KVM is currently capable of receiving a single memslot update through
>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>> The problem arises when we want to atomically perform multiple
>>>>>> updates,
>>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>>
>>>>>> For example, in RHBZ
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>
>>>>> I don't have access.  Can you provide a TL;DR?
>>>>
>>>> You should be able to have access to it now.
>>>>
>>>>>
>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>>> and will crash.
>>>>>
>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>> of a complexity
>>>>> to take on for something that appears to be solvable in userspace.
>>>>>
>>>>
>>>> I think it is not that easy to solve in userspace: see
>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>
>>>>
>>>>
>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>> when called from vcpu context e.g., in virtio code)."
>>>
>>> Can you please comment on the bigger picture? The patch from me works
>>> around *exactly that*, and for that reason, contains that comment.
>>>
>>
>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>> transactions (not just resizes) where ranges do add overlap with ranges
>> to remove.
>>
>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>
>>
>> I only performed simple boot check under x86-64 (where I can see region
>> resizes) and some make checks -- pretty sure it has some rough edges;
>> but should indicate what's possible and what the possible price might
>> be. [one could wire up a new KVM ioctl and call it conditionally on
>> support if really required]
>>
> 
> A benefit of my ioctl implementation is that could be also used by other
> hypervisors, which then do not need to share this kind of "hack".
> However, after also talking with Maxim and Paolo, we all agreed that the
> main disadvantage of your approach is that is not scalable with the
> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
> memslot update (KVM only pauses vCPUs that access the modified memslots
> instead).
> 
> So I took some measurements, to see what is the performance difference
> between my implementation and yours. I used a machine where I could
> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
> Processor with kernel 5.19.0 (+ my patches).
> 
> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
> memslot updates) while booting a VM. In other words, if kvm_commit takes
> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
> not called anymore after boot, so this measurement is easy to compare
> over multiple invocations of QEMU.
> 
> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
> command is the same to replicate the bug:
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none >> ~/smp_$v;
> 
> Each boot is reproduced 100 times, and then from results I measure
> average and stddev (in milliseconds).
> 
> ioctl:
> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
> 
> 
> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
> -smp 4:        Average: 3.1ms        Stdev: 1.3m
> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
> 
> 
> Above 24 vCPUs performance gets worse quickly but I think it's already
> quite clear that the results for ioctl scale better as the number of
> vcpus increases, while pausing the vCPUs becomes slower already with 16
> vcpus.

Right, the question is if it happens sufficiently enough that we even 
care and if there are not ways to mitigate.

It doesn't necessarily have to scale with the #VCPUs I think. What 
should dominate the overall time in theory how long it takes for one 
VCPU (the slowest one) to leave the kernel.

I wondered if

1) it might be easier to have a single KVM mechanism/call to kick all 
VCPUs out of KVM instead of doing it per VCPU. That might speed up 
things eventually heavily already.

2) One thing I wondered is whether the biggest overhead is actually 
taking the locks in QEMU and not actually waiting for the VCPUs. Maybe 
we could optimize that as well. (for now I use one lock per VCPU because 
it felt like it would reduce the ioctl overhead; maybe there is a better 
alternative to balance between both users)

So treat my patch as a completely unoptimized version.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-23 13:21           ` David Hildenbrand
@ 2022-09-23 13:38             ` Emanuele Giuseppe Esposito
  2022-09-26  9:03               ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-23 13:38 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Paolo Bonzini, Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu



Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>> through
>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>> updates,
>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>> states.
>>>>>>>
>>>>>>> For example, in RHBZ
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>
>>>>>> I don't have access.  Can you provide a TL;DR?
>>>>>
>>>>> You should be able to have access to it now.
>>>>>
>>>>>>
>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>> created)
>>>>>>> and will crash.
>>>>>>
>>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>>> of a complexity
>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>
>>>>>
>>>>> I think it is not that easy to solve in userspace: see
>>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>> will
>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>> when called from vcpu context e.g., in virtio code)."
>>>>
>>>> Can you please comment on the bigger picture? The patch from me works
>>>> around *exactly that*, and for that reason, contains that comment.
>>>>
>>>
>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>> transactions (not just resizes) where ranges do add overlap with ranges
>>> to remove.
>>>
>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>
>>>
>>> I only performed simple boot check under x86-64 (where I can see region
>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>> but should indicate what's possible and what the possible price might
>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>> support if really required]
>>>
>>
>> A benefit of my ioctl implementation is that could be also used by other
>> hypervisors, which then do not need to share this kind of "hack".
>> However, after also talking with Maxim and Paolo, we all agreed that the
>> main disadvantage of your approach is that is not scalable with the
>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>> memslot update (KVM only pauses vCPUs that access the modified memslots
>> instead).
>>
>> So I took some measurements, to see what is the performance difference
>> between my implementation and yours. I used a machine where I could
>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>> Processor with kernel 5.19.0 (+ my patches).
>>
>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>> not called anymore after boot, so this measurement is easy to compare
>> over multiple invocations of QEMU.
>>
>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>> command is the same to replicate the bug:
>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>> --display none >> ~/smp_$v;
>>
>> Each boot is reproduced 100 times, and then from results I measure
>> average and stddev (in milliseconds).
>>
>> ioctl:
>> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
>> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
>> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
>> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
>> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
>> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
>>
>>
>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
>> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
>> -smp 4:        Average: 3.1ms        Stdev: 1.3m
>> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
>> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
>> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
>>
>>
>> Above 24 vCPUs performance gets worse quickly but I think it's already
>> quite clear that the results for ioctl scale better as the number of
>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>> vcpus.
> 
> Right, the question is if it happens sufficiently enough that we even
> care and if there are not ways to mitigate.
> 
> It doesn't necessarily have to scale with the #VCPUs I think. What
> should dominate the overall time in theory how long it takes for one
> VCPU (the slowest one) to leave the kernel.
> 
> I wondered if
> 
> 1) it might be easier to have a single KVM mechanism/call to kick all
> VCPUs out of KVM instead of doing it per VCPU. That might speed up
> things eventually heavily already.

So if I understand correclty, this implies creating a new ioctl in KVM
anyways? What would be then the difference with what I do? We are
affecting the kernel anyways.

> 
> 2) One thing I wondered is whether the biggest overhead is actually
> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
> we could optimize that as well. (for now I use one lock per VCPU because
> it felt like it would reduce the ioctl overhead; maybe there is a better
> alternative to balance between both users)
> 
> So treat my patch as a completely unoptimized version.
> 
For what is worth, also my version performs #invalidate+1 swaps, which
is not optimized.

Honestly, I don't see how the above is easier or simpler than what is
being proposed here.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-23 13:38             ` Emanuele Giuseppe Esposito
@ 2022-09-26  9:03               ` David Hildenbrand
  2022-09-26 21:28                 ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-09-26  9:03 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini,
	Maxim Levitsky
  Cc: kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>>
>>>>>>
>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>>> through
>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>>> updates,
>>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>>> states.
>>>>>>>>
>>>>>>>> For example, in RHBZ
>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>>
>>>>>>> I don't have access.  Can you provide a TL;DR?
>>>>>>
>>>>>> You should be able to have access to it now.
>>>>>>
>>>>>>>
>>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>>> created)
>>>>>>>> and will crash.
>>>>>>>
>>>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>>>> of a complexity
>>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>>
>>>>>>
>>>>>> I think it is not that easy to solve in userspace: see
>>>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>>> will
>>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>>> when called from vcpu context e.g., in virtio code)."
>>>>>
>>>>> Can you please comment on the bigger picture? The patch from me works
>>>>> around *exactly that*, and for that reason, contains that comment.
>>>>>
>>>>
>>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>>> transactions (not just resizes) where ranges do add overlap with ranges
>>>> to remove.
>>>>
>>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>>
>>>>
>>>> I only performed simple boot check under x86-64 (where I can see region
>>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>>> but should indicate what's possible and what the possible price might
>>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>>> support if really required]
>>>>
>>>
>>> A benefit of my ioctl implementation is that could be also used by other
>>> hypervisors, which then do not need to share this kind of "hack".
>>> However, after also talking with Maxim and Paolo, we all agreed that the
>>> main disadvantage of your approach is that is not scalable with the
>>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>>> memslot update (KVM only pauses vCPUs that access the modified memslots
>>> instead).
>>>
>>> So I took some measurements, to see what is the performance difference
>>> between my implementation and yours. I used a machine where I could
>>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>>> Processor with kernel 5.19.0 (+ my patches).
>>>
>>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>>> not called anymore after boot, so this measurement is easy to compare
>>> over multiple invocations of QEMU.
>>>
>>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>>> command is the same to replicate the bug:
>>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>>> --display none >> ~/smp_$v;
>>>
>>> Each boot is reproduced 100 times, and then from results I measure
>>> average and stddev (in milliseconds).
>>>
>>> ioctl:
>>> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
>>> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
>>> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
>>> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
>>> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
>>> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
>>>
>>>
>>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>>> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
>>> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
>>> -smp 4:        Average: 3.1ms        Stdev: 1.3m
>>> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
>>> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
>>> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
>>>
>>>
>>> Above 24 vCPUs performance gets worse quickly but I think it's already
>>> quite clear that the results for ioctl scale better as the number of
>>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>>> vcpus.
>>
>> Right, the question is if it happens sufficiently enough that we even
>> care and if there are not ways to mitigate.
>>
>> It doesn't necessarily have to scale with the #VCPUs I think. What
>> should dominate the overall time in theory how long it takes for one
>> VCPU (the slowest one) to leave the kernel.
>>
>> I wondered if
>>
>> 1) it might be easier to have a single KVM mechanism/call to kick all
>> VCPUs out of KVM instead of doing it per VCPU. That might speed up
>> things eventually heavily already.
> 
> So if I understand correclty, this implies creating a new ioctl in KVM
> anyways? What would be then the difference with what I do? We are
> affecting the kernel anyways.
> 

If that turns out to be the actual bottleneck. I haven't analyzed what 
exactly is causing the delay with increasing # VCPUs, so I'm just guessing.

Right now, we primarily use pthread_kill(SIG_IPI) to send a signal to 
each VCPU thread.

>>
>> 2) One thing I wondered is whether the biggest overhead is actually
>> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
>> we could optimize that as well. (for now I use one lock per VCPU because
>> it felt like it would reduce the ioctl overhead; maybe there is a better
>> alternative to balance between both users)
>>
>> So treat my patch as a completely unoptimized version.
>>
> For what is worth, also my version performs #invalidate+1 swaps, which
> is not optimized.
> 
> Honestly, I don't see how the above is easier or simpler than what is
> being proposed here.

As Sean said "This is an awful lot of a complexity to take on for 
something that appears to be solvable in userspace."

I showed that it's possible. So from that point on, there has to be good 
justification why that complexity has to exist in kernel space to 
optimize this scenario.

Now, it's not my call to make. I'm just pointing out that it can be done 
and that there might be room for improvement in my quick prototype.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-26  9:03               ` David Hildenbrand
@ 2022-09-26 21:28                 ` Sean Christopherson
  2022-09-27  7:38                   ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-09-26 21:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Emanuele Giuseppe Esposito, Paolo Bonzini, Maxim Levitsky, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
> > 
> > 
> > Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> > > On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> > > > 
> > > > 
> > > > Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
> > > > > On 19.09.22 09:53, David Hildenbrand wrote:
> > > > > > On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
> > > > > > > > On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
> > > > > > > > > KVM is currently capable of receiving a single memslot update
> > > > > > > > > through
> > > > > > > > > the KVM_SET_USER_MEMORY_REGION ioctl.
> > > > > > > > > The problem arises when we want to atomically perform multiple
> > > > > > > > > updates,
> > > > > > > > > so that readers of memslot active list avoid seeing incomplete
> > > > > > > > > states.
> > > > > > > > > 
> > > > > > > > > For example, in RHBZ
> > > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1979276

...

> As Sean said "This is an awful lot of a complexity to take on for something
> that appears to be solvable in userspace."

And if the userspace solution is unpalatable for whatever reason, I'd like to
understand exactly what KVM behavior is problematic for userspace.  E.g. the
above RHBZ bug should no longer be an issue as the buggy commit has since been
reverted.

If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
the race _if_ userspace chooses not to pause vCPUs.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-26 21:28                 ` Sean Christopherson
@ 2022-09-27  7:38                   ` Emanuele Giuseppe Esposito
  2022-09-27 15:58                     ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-27  7:38 UTC (permalink / raw)
  To: Sean Christopherson, David Hildenbrand
  Cc: Paolo Bonzini, Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, linux-kernel,
	Like Xu



Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
>>>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>>>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>>>>> through
>>>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>>>>> updates,
>>>>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>>>>> states.
>>>>>>>>>>
>>>>>>>>>> For example, in RHBZ
>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
> 
> ...
> 
>> As Sean said "This is an awful lot of a complexity to take on for something
>> that appears to be solvable in userspace."
> 
> And if the userspace solution is unpalatable for whatever reason, I'd like to
> understand exactly what KVM behavior is problematic for userspace.  E.g. the
> above RHBZ bug should no longer be an issue as the buggy commit has since been
> reverted.

It still is because I can reproduce the bug, as also pointed out in
multiple comments below.

> 
> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> the race _if_ userspace chooses not to pause vCPUs.
> 

Also on the BZ they all seem (Paolo included) to agree that the issue is
non-atomic memslots update.

To be more precise, what I did mostly follows what Peter explained in
comment 19 : https://bugzilla.redhat.com/show_bug.cgi?id=1979276#c19


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-09 10:45 ` [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update Emanuele Giuseppe Esposito
  2022-09-13  2:30   ` Yang, Weijiang
@ 2022-09-27  7:46   ` David Hildenbrand
  2022-09-27  8:35     ` Emanuele Giuseppe Esposito
  2022-09-28 17:29   ` Paolo Bonzini
  2 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-09-27  7:46 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
> to make sure that all memslots are updated in the inactive list
> and then swap (preferreably only once) the lists, so that all
> changes are visible immediately.
> 
> The only issue is that DELETE and MOVE need to perform 2 swaps:
> firstly replace old memslot with invalid, and then remove invalid.
> 

I'm curious, how would a resize (grow/shrink) or a split be handled?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-27  7:46   ` David Hildenbrand
@ 2022-09-27  8:35     ` Emanuele Giuseppe Esposito
  2022-09-27  9:22       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-27  8:35 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel



Am 27/09/2022 um 09:46 schrieb David Hildenbrand:
> On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>> to make sure that all memslots are updated in the inactive list
>> and then swap (preferreably only once) the lists, so that all
>> changes are visible immediately.
>>
>> The only issue is that DELETE and MOVE need to perform 2 swaps:
>> firstly replace old memslot with invalid, and then remove invalid.
>>
> 
> I'm curious, how would a resize (grow/shrink) or a split be handled?
> 

There are only 4 operations possible in KVM: KVM_MR_{DELETE, MOVE,
CREATE, FLAGS_ONLY}.

A resize should be implemented in QEMU as DELETE+CREATE.

Therefore a resize on memslot X will be implemented as:
First pass on the userspace operations:
	invalidate memslot X;
	swap_memslot_list(); // NOW it is visible to the guest

What guest sees: memslot X is invalid, so MMU keeps retrying the page fault

Second pass:
	create new memslot X
	delete old memslot X


What guest sees: memslot X is invalid, so MMU keeps retrying the page fault

Third pass:
	swap() // 1 for each address space affected

What guest sees: new memslot replacing the old one



A split is DELETE+CREATE+CREATE, so it's the same:

First pass on the userspace operations:
	invalidate memslot X;
	swap_memslot_list(); // NOW it is visible to the guest

What guest sees: memslot X is invalid, so MMU keeps retrying the page fault

Second pass:
	delete old memslot X
	create new memslot X1
	create new memslot X2

What guest sees: memslot X is invalid, so MMU keeps retrying the page fault

Third pass:
	swap() // 1 for each address space affected

What guest sees: two new memslots replacing the old one


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-27  8:35     ` Emanuele Giuseppe Esposito
@ 2022-09-27  9:22       ` David Hildenbrand
  2022-09-27  9:32         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-09-27  9:22 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 27.09.22 10:35, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 27/09/2022 um 09:46 schrieb David Hildenbrand:
>> On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
>>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>>> to make sure that all memslots are updated in the inactive list
>>> and then swap (preferreably only once) the lists, so that all
>>> changes are visible immediately.
>>>
>>> The only issue is that DELETE and MOVE need to perform 2 swaps:
>>> firstly replace old memslot with invalid, and then remove invalid.
>>>
>>
>> I'm curious, how would a resize (grow/shrink) or a split be handled?
>>
> 
> There are only 4 operations possible in KVM: KVM_MR_{DELETE, MOVE,
> CREATE, FLAGS_ONLY}.
> 
> A resize should be implemented in QEMU as DELETE+CREATE.
> 
> Therefore a resize on memslot X will be implemented as:
> First pass on the userspace operations:
> 	invalidate memslot X;
> 	swap_memslot_list(); // NOW it is visible to the guest
> 
> What guest sees: memslot X is invalid, so MMU keeps retrying the page fault
> 
> Second pass:
> 	create new memslot X
> 	delete old memslot X

Thanks a lot for the very nice explanation!

Does the invalidation already free up memslot metadata (especially the 
rmaps) or will we end up temporarily allocating twice the memslot metadata?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-27  9:22       ` David Hildenbrand
@ 2022-09-27  9:32         ` Emanuele Giuseppe Esposito
  2022-09-27 14:52           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-27  9:32 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel



Am 27/09/2022 um 11:22 schrieb David Hildenbrand:
> On 27.09.22 10:35, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 27/09/2022 um 09:46 schrieb David Hildenbrand:
>>> On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
>>>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>>>> to make sure that all memslots are updated in the inactive list
>>>> and then swap (preferreably only once) the lists, so that all
>>>> changes are visible immediately.
>>>>
>>>> The only issue is that DELETE and MOVE need to perform 2 swaps:
>>>> firstly replace old memslot with invalid, and then remove invalid.
>>>>
>>>
>>> I'm curious, how would a resize (grow/shrink) or a split be handled?
>>>
>>
>> There are only 4 operations possible in KVM: KVM_MR_{DELETE, MOVE,
>> CREATE, FLAGS_ONLY}.
>>
>> A resize should be implemented in QEMU as DELETE+CREATE.
>>
>> Therefore a resize on memslot X will be implemented as:
>> First pass on the userspace operations:
>>     invalidate memslot X;
>>     swap_memslot_list(); // NOW it is visible to the guest
>>
>> What guest sees: memslot X is invalid, so MMU keeps retrying the page
>> fault
>>
>> Second pass:
>>     create new memslot X
>>     delete old memslot X
> 
> Thanks a lot for the very nice explanation!

Anytime :)

> Does the invalidation already free up memslot metadata (especially the
> rmaps) or will we end up temporarily allocating twice the memslot metadata?
> 

Invalidation creates a new temporary identical memslot, I am not sure
about the rmaps. It is anyways the same code as it was done before and
if I understand correctly, a new slot is required to keep the old
intact, in case something goes wrong and we need to revert.

Thanks,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-27  9:32         ` Emanuele Giuseppe Esposito
@ 2022-09-27 14:52           ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2022-09-27 14:52 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

>> Does the invalidation already free up memslot metadata (especially the
>> rmaps) or will we end up temporarily allocating twice the memslot metadata?
>>
> 
> Invalidation creates a new temporary identical memslot, I am not sure
> about the rmaps. It is anyways the same code as it was done before and
> if I understand correctly, a new slot is required to keep the old
> intact, in case something goes wrong and we need to revert.

Okay, thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-27  7:38                   ` Emanuele Giuseppe Esposito
@ 2022-09-27 15:58                     ` Sean Christopherson
  2022-09-28  9:11                       ` Emanuele Giuseppe Esposito
  2022-09-28 15:07                       ` Paolo Bonzini
  0 siblings, 2 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-09-27 15:58 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: David Hildenbrand, Paolo Bonzini, Maxim Levitsky, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
> 
> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> > On Mon, Sep 26, 2022, David Hildenbrand wrote:
> >> As Sean said "This is an awful lot of a complexity to take on for something
> >> that appears to be solvable in userspace."
> > 
> > And if the userspace solution is unpalatable for whatever reason, I'd like to
> > understand exactly what KVM behavior is problematic for userspace.  E.g. the
> > above RHBZ bug should no longer be an issue as the buggy commit has since been
> > reverted.
> 
> It still is because I can reproduce the bug, as also pointed out in
> multiple comments below.

You can reproduce _a_ bug, but it's obviously not the original bug, because the
last comment says:

  Second, indeed the patch was reverted and somehow accepted without generating
  too much noise:

  ...

  The underlying issue of course as we both know is still there.

  You might have luck reproducing it with this bug

  https://bugzilla.redhat.com/show_bug.cgi?id=1855298

  But for me it looks like it is 'working' as well, so you might have
  to write a unit test to trigger the issue.

> > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> > the race _if_ userspace chooses not to pause vCPUs.
> > 
> 
> Also on the BZ they all seem (Paolo included) to agree that the issue is
> non-atomic memslots update.

Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
memslot.  I'm asking for an explanation of exactly what happens when that occurs,
because it should be possible to adjust KVM and/or QEMU to play nice with the
fetch, e.g. to resume the guest until the new memslot is installed, in which case
an atomic update isn't needed.

I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
the MMIO code fetch.

I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
otherwise exit with mmio.len==0.

Compile tested only...

---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 27 Sep 2022 08:16:03 -0700
Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
 MMIO fetch

Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
a memslot.  If userspace is manipulating memslots without pausing vCPUs,
e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
valid memslot installed.  Depending on guest context, KVM will either
exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
resume the guest (L2 or CPL>0), neither of which is desirable as exiting
with "emulation error" effectively kills the VM, and resuming the guest
doesn't provide userspace an opportunity to react the to fetch.

Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
there is no other way to communicate "fetch" to userspace without
defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
and not a flag, and there is no known use case for actually supporting
code fetches from MMIO, i.e. there's no need to allow userspace to fill
in the instruction bytes.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/emulate.c     | 2 ++
 arch/x86/kvm/kvm_emulate.h | 1 +
 arch/x86/kvm/x86.c         | 9 ++++++++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f092c54d1a2f..e141238d93b0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
 done:
 	if (rc == X86EMUL_PROPAGATE_FAULT)
 		ctxt->have_exception = true;
+	if (rc == X86EMUL_IO_NEEDED)
+		return EMULATION_IO_FETCH;
 	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
 }
 
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 89246446d6aa..3cb2e321fcd2 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+#define EMULATION_IO_FETCH 3
 void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa5ab0c620de..7eb72694c601 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 		bytes = (unsigned)PAGE_SIZE - offset;
 	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
 				       offset, bytes);
-	if (unlikely(ret < 0))
+	if (unlikely(ret < 0)) {
+		vcpu->run->mmio.phys_addr = gpa;
+		vcpu->run->mmio.len = 0;
+		vcpu->run->mmio.is_write = 0;
+		vcpu->run->exit_reason = KVM_EXIT_MMIO;
 		return X86EMUL_IO_NEEDED;
+	}
 
 	return X86EMUL_CONTINUE;
 }
@@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		r = x86_decode_emulated_instruction(vcpu, emulation_type,
 						    insn, insn_len);
 		if (r != EMULATION_OK)  {
+			if (r == EMULATION_IO_FETCH)
+				return 0;
 			if ((emulation_type & EMULTYPE_TRAP_UD) ||
 			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
 				kvm_queue_exception(vcpu, UD_VECTOR);

base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
-- 


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-27 15:58                     ` Sean Christopherson
@ 2022-09-28  9:11                       ` Emanuele Giuseppe Esposito
  2022-09-28 11:14                         ` Maxim Levitsky
  2022-09-28 15:07                       ` Paolo Bonzini
  1 sibling, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-28  9:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Paolo Bonzini, Maxim Levitsky, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu



Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>
>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>> that appears to be solvable in userspace."
>>>
>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>> reverted.
>>
>> It still is because I can reproduce the bug, as also pointed out in
>> multiple comments below.
> 
> You can reproduce _a_ bug, but it's obviously not the original bug, because the
> last comment says:
> 
>   Second, indeed the patch was reverted and somehow accepted without generating
>   too much noise:
> 
>   ...
> 
>   The underlying issue of course as we both know is still there.
> 
>   You might have luck reproducing it with this bug
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> 
>   But for me it looks like it is 'working' as well, so you might have
>   to write a unit test to trigger the issue.
> 
>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>> the race _if_ userspace chooses not to pause vCPUs.
>>>
>>
>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>> non-atomic memslots update.
> 
> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
> because it should be possible to adjust KVM and/or QEMU to play nice with the
> fetch, e.g. to resume the guest until the new memslot is installed, in which case
> an atomic update isn't needed.
> 
> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> the MMIO code fetch.
> 
> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> otherwise exit with mmio.len==0.
> 
> Compile tested only...

So basically you are just making KVM catch the failed
kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
basically ends up in doing nothing and retry again executing the
instruction?

I wonder if there are some performance implications in this, but it's
definitely simpler than what I did.

Tested on the same failing machine used for the BZ, fixes the bug.

Do you want me to re-send the patch on your behalf (and add probably a
small documentation on Documentation/virt/kvm/api.rst)?

Emanuele
> 
> ---
> From: Sean Christopherson <seanjc@google.com>
> Date: Tue, 27 Sep 2022 08:16:03 -0700
> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
>  MMIO fetch
> 
> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
> a memslot.  If userspace is manipulating memslots without pausing vCPUs,
> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
> valid memslot installed.  Depending on guest context, KVM will either
> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
> resume the guest (L2 or CPL>0), neither of which is desirable as exiting
> with "emulation error" effectively kills the VM, and resuming the guest
> doesn't provide userspace an opportunity to react the to fetch.
> 
> Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
> there is no other way to communicate "fetch" to userspace without
> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
> and not a flag, and there is no known use case for actually supporting
> code fetches from MMIO, i.e. there's no need to allow userspace to fill
> in the instruction bytes.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/emulate.c     | 2 ++
>  arch/x86/kvm/kvm_emulate.h | 1 +
>  arch/x86/kvm/x86.c         | 9 ++++++++-
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index f092c54d1a2f..e141238d93b0 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>  done:
>  	if (rc == X86EMUL_PROPAGATE_FAULT)
>  		ctxt->have_exception = true;
> +	if (rc == X86EMUL_IO_NEEDED)
> +		return EMULATION_IO_FETCH;
>  	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
>  }
>  
> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> index 89246446d6aa..3cb2e321fcd2 100644
> --- a/arch/x86/kvm/kvm_emulate.h
> +++ b/arch/x86/kvm/kvm_emulate.h
> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>  #define EMULATION_OK 0
>  #define EMULATION_RESTART 1
>  #define EMULATION_INTERCEPTED 2
> +#define EMULATION_IO_FETCH 3
>  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index aa5ab0c620de..7eb72694c601 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
>  		bytes = (unsigned)PAGE_SIZE - offset;
>  	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
>  				       offset, bytes);
> -	if (unlikely(ret < 0))
> +	if (unlikely(ret < 0)) {
> +		vcpu->run->mmio.phys_addr = gpa;
> +		vcpu->run->mmio.len = 0;
> +		vcpu->run->mmio.is_write = 0;
> +		vcpu->run->exit_reason = KVM_EXIT_MMIO;
>  		return X86EMUL_IO_NEEDED;
> +	}
>  
>  	return X86EMUL_CONTINUE;
>  }
> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		r = x86_decode_emulated_instruction(vcpu, emulation_type,
>  						    insn, insn_len);
>  		if (r != EMULATION_OK)  {
> +			if (r == EMULATION_IO_FETCH)
> +				return 0;
>  			if ((emulation_type & EMULTYPE_TRAP_UD) ||
>  			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
>  				kvm_queue_exception(vcpu, UD_VECTOR);
> 
> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28  9:11                       ` Emanuele Giuseppe Esposito
@ 2022-09-28 11:14                         ` Maxim Levitsky
  2022-09-28 12:52                           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Maxim Levitsky @ 2022-09-28 11:14 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson
  Cc: David Hildenbrand, Paolo Bonzini, kvm, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-kernel, Like Xu

On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote:
> 
> Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
> > On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
> > > Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> > > > On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > > > > As Sean said "This is an awful lot of a complexity to take on for something
> > > > > that appears to be solvable in userspace."
> > > > 
> > > > And if the userspace solution is unpalatable for whatever reason, I'd like to
> > > > understand exactly what KVM behavior is problematic for userspace.  E.g. the
> > > > above RHBZ bug should no longer be an issue as the buggy commit has since been
> > > > reverted.
> > > 
> > > It still is because I can reproduce the bug, as also pointed out in
> > > multiple comments below.
> > 
> > You can reproduce _a_ bug, but it's obviously not the original bug, because the
> > last comment says:
> > 
> >   Second, indeed the patch was reverted and somehow accepted without generating
> >   too much noise:
> > 
> >   ...
> > 
> >   The underlying issue of course as we both know is still there.
> > 
> >   You might have luck reproducing it with this bug
> > 
> >   https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> > 
> >   But for me it looks like it is 'working' as well, so you might have
> >   to write a unit test to trigger the issue.
> > 
> > > > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> > > > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> > > > the race _if_ userspace chooses not to pause vCPUs.
> > > > 
> > > 
> > > Also on the BZ they all seem (Paolo included) to agree that the issue is
> > > non-atomic memslots update.
> > 
> > Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> > memslot.  I'm asking for an explanation of exactly what happens when that occurs,
> > because it should be possible to adjust KVM and/or QEMU to play nice with the
> > fetch, e.g. to resume the guest until the new memslot is installed, in which case
> > an atomic update isn't needed.
> > 
> > I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> > guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
> > then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> > of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> > the MMIO code fetch.
> > 
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
> > 
> > Compile tested only...
> 
> So basically you are just making KVM catch the failed
> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
> basically ends up in doing nothing and retry again executing the
> instruction?
> 
> I wonder if there are some performance implications in this, but it's
> definitely simpler than what I did.
> 
> Tested on the same failing machine used for the BZ, fixes the bug.
> 
> Do you want me to re-send the patch on your behalf (and add probably a
> small documentation on Documentation/virt/kvm/api.rst)?
> 
> Emanuele
> > ---
> > From: Sean Christopherson <seanjc@google.com>
> > Date: Tue, 27 Sep 2022 08:16:03 -0700
> > Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
> >  MMIO fetch
> > 
> > Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
> > able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
> > a memslot.  If userspace is manipulating memslots without pausing vCPUs,
> > e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
> > valid memslot installed.  Depending on guest context, KVM will either
> > exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
> > resume the guest (L2 or CPL>0), neither of which is desirable as exiting
> > with "emulation error" effectively kills the VM, and resuming the guest
> > doesn't provide userspace an opportunity to react the to fetch.
> > 
> > Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
> > there is no other way to communicate "fetch" to userspace without
> > defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
> > and not a flag, and there is no known use case for actually supporting
> > code fetches from MMIO, i.e. there's no need to allow userspace to fill
> > in the instruction bytes.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/emulate.c     | 2 ++
> >  arch/x86/kvm/kvm_emulate.h | 1 +
> >  arch/x86/kvm/x86.c         | 9 ++++++++-
> >  3 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > index f092c54d1a2f..e141238d93b0 100644
> > --- a/arch/x86/kvm/emulate.c
> > +++ b/arch/x86/kvm/emulate.c
> > @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
> >  done:
> >  	if (rc == X86EMUL_PROPAGATE_FAULT)
> >  		ctxt->have_exception = true;
> > +	if (rc == X86EMUL_IO_NEEDED)
> > +		return EMULATION_IO_FETCH;
> >  	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
> >  }
> >  
> > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> > index 89246446d6aa..3cb2e321fcd2 100644
> > --- a/arch/x86/kvm/kvm_emulate.h
> > +++ b/arch/x86/kvm/kvm_emulate.h
> > @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> >  #define EMULATION_OK 0
> >  #define EMULATION_RESTART 1
> >  #define EMULATION_INTERCEPTED 2
> > +#define EMULATION_IO_FETCH 3
> >  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
> >  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
> >  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index aa5ab0c620de..7eb72694c601 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
> >  		bytes = (unsigned)PAGE_SIZE - offset;
> >  	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
> >  				       offset, bytes);
> > -	if (unlikely(ret < 0))
> > +	if (unlikely(ret < 0)) {
> > +		vcpu->run->mmio.phys_addr = gpa;
> > +		vcpu->run->mmio.len = 0;
> > +		vcpu->run->mmio.is_write = 0;
> > +		vcpu->run->exit_reason = KVM_EXIT_MMIO;
> >  		return X86EMUL_IO_NEEDED;
> > +	}
> >  
> >  	return X86EMUL_CONTINUE;
> >  }
> > @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  		r = x86_decode_emulated_instruction(vcpu, emulation_type,
> >  						    insn, insn_len);
> >  		if (r != EMULATION_OK)  {
> > +			if (r == EMULATION_IO_FETCH)
> > +				return 0;
> >  			if ((emulation_type & EMULTYPE_TRAP_UD) ||
> >  			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> >  				kvm_queue_exception(vcpu, UD_VECTOR);
> > 
> > base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
> > 

Note that AFAIK, there is another case (and probably more), if TDP is disabled,
and MMU root is in mmio, we kill the guest.


mmu_alloc_shadow_roots -> mmu_check_root


I used to have few hacks in KVM to cope with this, but AFAIK,
I gave up on it, because the issue would show up again and again.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 11:14                         ` Maxim Levitsky
@ 2022-09-28 12:52                           ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2022-09-28 12:52 UTC (permalink / raw)
  To: Maxim Levitsky, Emanuele Giuseppe Esposito, Sean Christopherson
  Cc: Paolo Bonzini, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu

On 28.09.22 13:14, Maxim Levitsky wrote:
> On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote:
>>
>> Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
>>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>>>> that appears to be solvable in userspace."
>>>>>
>>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>>>> reverted.
>>>>
>>>> It still is because I can reproduce the bug, as also pointed out in
>>>> multiple comments below.
>>>
>>> You can reproduce _a_ bug, but it's obviously not the original bug, because the
>>> last comment says:
>>>
>>>    Second, indeed the patch was reverted and somehow accepted without generating
>>>    too much noise:
>>>
>>>    ...
>>>
>>>    The underlying issue of course as we both know is still there.
>>>
>>>    You might have luck reproducing it with this bug
>>>
>>>    https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>>>
>>>    But for me it looks like it is 'working' as well, so you might have
>>>    to write a unit test to trigger the issue.
>>>
>>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>>>> the race _if_ userspace chooses not to pause vCPUs.
>>>>>
>>>>
>>>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>>>> non-atomic memslots update.
>>>
>>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
>>> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
>>> because it should be possible to adjust KVM and/or QEMU to play nice with the
>>> fetch, e.g. to resume the guest until the new memslot is installed, in which case
>>> an atomic update isn't needed.
>>>
>>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
>>> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
>>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
>>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
>>> the MMIO code fetch.
>>>
>>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>>> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
>>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>>> otherwise exit with mmio.len==0.
>>>
>>> Compile tested only...
>>
>> So basically you are just making KVM catch the failed
>> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
>> basically ends up in doing nothing and retry again executing the
>> instruction?
>>
>> I wonder if there are some performance implications in this, but it's
>> definitely simpler than what I did.
>>
>> Tested on the same failing machine used for the BZ, fixes the bug.
>>
>> Do you want me to re-send the patch on your behalf (and add probably a
>> small documentation on Documentation/virt/kvm/api.rst)?
>>
>> Emanuele
>>> ---
>>> From: Sean Christopherson <seanjc@google.com>
>>> Date: Tue, 27 Sep 2022 08:16:03 -0700
>>> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
>>>   MMIO fetch
>>>
>>> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
>>> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
>>> a memslot.  If userspace is manipulating memslots without pausing vCPUs,
>>> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
>>> valid memslot installed.  Depending on guest context, KVM will either
>>> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
>>> resume the guest (L2 or CPL>0), neither of which is desirable as exiting
>>> with "emulation error" effectively kills the VM, and resuming the guest
>>> doesn't provide userspace an opportunity to react the to fetch.
>>>
>>> Use "mmio.len == 0" to indicate "fetch".  This is a bit of a hack, but
>>> there is no other way to communicate "fetch" to userspace without
>>> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
>>> and not a flag, and there is no known use case for actually supporting
>>> code fetches from MMIO, i.e. there's no need to allow userspace to fill
>>> in the instruction bytes.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>   arch/x86/kvm/emulate.c     | 2 ++
>>>   arch/x86/kvm/kvm_emulate.h | 1 +
>>>   arch/x86/kvm/x86.c         | 9 ++++++++-
>>>   3 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>>> index f092c54d1a2f..e141238d93b0 100644
>>> --- a/arch/x86/kvm/emulate.c
>>> +++ b/arch/x86/kvm/emulate.c
>>> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>>>   done:
>>>   	if (rc == X86EMUL_PROPAGATE_FAULT)
>>>   		ctxt->have_exception = true;
>>> +	if (rc == X86EMUL_IO_NEEDED)
>>> +		return EMULATION_IO_FETCH;
>>>   	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
>>>   }
>>>   
>>> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
>>> index 89246446d6aa..3cb2e321fcd2 100644
>>> --- a/arch/x86/kvm/kvm_emulate.h
>>> +++ b/arch/x86/kvm/kvm_emulate.h
>>> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>>>   #define EMULATION_OK 0
>>>   #define EMULATION_RESTART 1
>>>   #define EMULATION_INTERCEPTED 2
>>> +#define EMULATION_IO_FETCH 3
>>>   void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>>>   int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>>>   int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index aa5ab0c620de..7eb72694c601 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
>>>   		bytes = (unsigned)PAGE_SIZE - offset;
>>>   	ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
>>>   				       offset, bytes);
>>> -	if (unlikely(ret < 0))
>>> +	if (unlikely(ret < 0)) {
>>> +		vcpu->run->mmio.phys_addr = gpa;
>>> +		vcpu->run->mmio.len = 0;
>>> +		vcpu->run->mmio.is_write = 0;
>>> +		vcpu->run->exit_reason = KVM_EXIT_MMIO;
>>>   		return X86EMUL_IO_NEEDED;
>>> +	}
>>>   
>>>   	return X86EMUL_CONTINUE;
>>>   }
>>> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>>   		r = x86_decode_emulated_instruction(vcpu, emulation_type,
>>>   						    insn, insn_len);
>>>   		if (r != EMULATION_OK)  {
>>> +			if (r == EMULATION_IO_FETCH)
>>> +				return 0;
>>>   			if ((emulation_type & EMULTYPE_TRAP_UD) ||
>>>   			    (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
>>>   				kvm_queue_exception(vcpu, UD_VECTOR);
>>>
>>> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
>>>
> 
> Note that AFAIK, there is another case (and probably more), if TDP is disabled,
> and MMU root is in mmio, we kill the guest.
> 
> 
> mmu_alloc_shadow_roots -> mmu_check_root
> 
> 
> I used to have few hacks in KVM to cope with this, but AFAIK,
> I gave up on it, because the issue would show up again and again.

IIRC, s390x can have real problems if we temporarily remove a memslot. 
In case the emulation/interpretation code tries accessing guest memory 
and fails because there is no memslot describing that portion of guest RAM.

Note that resizing/merging/splitting currently shouldn't happen on 
s390x, though. But resizing of memslots might happen in the near future 
with virtio-mem in QEMU.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-27 15:58                     ` Sean Christopherson
  2022-09-28  9:11                       ` Emanuele Giuseppe Esposito
@ 2022-09-28 15:07                       ` Paolo Bonzini
  2022-09-28 15:33                         ` David Hildenbrand
  2022-09-28 15:58                         ` Sean Christopherson
  1 sibling, 2 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 15:07 UTC (permalink / raw)
  To: Sean Christopherson, Emanuele Giuseppe Esposito
  Cc: David Hildenbrand, Maxim Levitsky, kvm, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-kernel, Like Xu

On 9/27/22 17:58, Sean Christopherson wrote:
> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>
>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>> that appears to be solvable in userspace."
>>>
>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>> reverted.
>>
>> It still is because I can reproduce the bug, as also pointed out in
>> multiple comments below.
> 
> You can reproduce _a_ bug, but it's obviously not the original bug, because the
> last comment says:
> 
>    Second, indeed the patch was reverted and somehow accepted without generating
>    too much noise:
> 
>    ...
> 
>    The underlying issue of course as we both know is still there.
> 
>    You might have luck reproducing it with this bug
> 
>    https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> 
>    But for me it looks like it is 'working' as well, so you might have
>    to write a unit test to trigger the issue.
> 
>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>> the race _if_ userspace chooses not to pause vCPUs.
>>>
>>
>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>> non-atomic memslots update.
> 
> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
> because it should be possible to adjust KVM and/or QEMU to play nice with the
> fetch, e.g. to resume the guest until the new memslot is installed, in which case
> an atomic update isn't needed.
> 
> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> the MMIO code fetch.
> 
> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> otherwise exit with mmio.len==0.

I think this patch is not a good idea for two reasons:

1) we don't know how userspace behaves if mmio.len is zero.  It is of 
course reasonable to do nothing, but an assertion failure is also a 
valid behavior

2) more important, there is no way to distinguish a failure due to the 
guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from 
one due to the KVM_SET_USER_MEMORY_REGION race condition.  So this will 
cause a guest that correctly caused an internal error to loop forever.

While the former could be handled in a "wait and see" manner, the latter 
in particular is part of the KVM_RUN contract.  Of course it is possible 
for a guest to just loop forever, but in general all of KVM, QEMU and 
upper userspace layers want a crashed guest to be detected and stopped 
forever.

Yes, QEMU could loop only if memslot updates are in progress, but 
honestly all the alternatives I have seen to atomic memslot updates are 
really *awful*.  David's patches even invent a new kind of mutex for 
which I have absolutely no idea what kind of deadlocks one should worry 
about and why they should not exist; QEMU's locking is already pretty 
crappy, it's certainly not on my wishlist to make it worse!

This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) 
the kernel is the only place where you can have a *good* fix.  It should 
have been fixed years ago.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 15:07                       ` Paolo Bonzini
@ 2022-09-28 15:33                         ` David Hildenbrand
  2022-09-28 15:58                         ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2022-09-28 15:33 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Emanuele Giuseppe Esposito
  Cc: Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu

On 28.09.22 17:07, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>>
>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>>> that appears to be solvable in userspace."
>>>>
>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>>> understand exactly what KVM behavior is problematic for userspace.  E.g. the
>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>>> reverted.
>>>
>>> It still is because I can reproduce the bug, as also pointed out in
>>> multiple comments below.
>>
>> You can reproduce _a_ bug, but it's obviously not the original bug, because the
>> last comment says:
>>
>>     Second, indeed the patch was reverted and somehow accepted without generating
>>     too much noise:
>>
>>     ...
>>
>>     The underlying issue of course as we both know is still there.
>>
>>     You might have luck reproducing it with this bug
>>
>>     https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>>
>>     But for me it looks like it is 'working' as well, so you might have
>>     to write a unit test to trigger the issue.
>>
>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>>> the race _if_ userspace chooses not to pause vCPUs.
>>>>
>>>
>>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>>> non-atomic memslots update.
>>
>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
>> memslot.  I'm asking for an explanation of exactly what happens when that occurs,
>> because it should be possible to adjust KVM and/or QEMU to play nice with the
>> fetch, e.g. to resume the guest until the new memslot is installed, in which case
>> an atomic update isn't needed.
>>
>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
>> guest is running at CPL=0, and QEMU kills the guest in response.  If that's correct,
>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
>> the MMIO code fetch.
>>
>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>> otherwise exit with mmio.len==0.
> 
> I think this patch is not a good idea for two reasons:
> 
> 1) we don't know how userspace behaves if mmio.len is zero.  It is of
> course reasonable to do nothing, but an assertion failure is also a
> valid behavior
> 
> 2) more important, there is no way to distinguish a failure due to the
> guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from
> one due to the KVM_SET_USER_MEMORY_REGION race condition.  So this will
> cause a guest that correctly caused an internal error to loop forever.
> 
> While the former could be handled in a "wait and see" manner, the latter
> in particular is part of the KVM_RUN contract.  Of course it is possible
> for a guest to just loop forever, but in general all of KVM, QEMU and
> upper userspace layers want a crashed guest to be detected and stopped
> forever.
> 
> Yes, QEMU could loop only if memslot updates are in progress, but
> honestly all the alternatives I have seen to atomic memslot updates are
> really *awful*.  David's patches even invent a new kind of mutex for
> which I have absolutely no idea what kind of deadlocks one should worry
> about and why they should not exist; QEMU's locking is already pretty
> crappy, it's certainly not on my wishlist to make it worse!

Just to comment on that (I'm happy as long as this gets fixed), a simple 
mutex with trylock should get the thing done as well -- kicking the VCPU 
if the trylock fails. But I did not look further into locking alternatives.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 15:07                       ` Paolo Bonzini
  2022-09-28 15:33                         ` David Hildenbrand
@ 2022-09-28 15:58                         ` Sean Christopherson
  2022-09-28 16:38                           ` Paolo Bonzini
  1 sibling, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-09-28 15:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Emanuele Giuseppe Esposito, David Hildenbrand, Maxim Levitsky,
	kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
> 
> I think this patch is not a good idea for two reasons:
> 
> 1) we don't know how userspace behaves if mmio.len is zero.  It is of course
> reasonable to do nothing, but an assertion failure is also a valid behavior

Except that KVM currently does neither.  If the fetch happens at CPL>0 and/or in
L2, KVM injects #UD.  That's flat out architecturally invalid.  If it's a sticking
point, the mmio.len==0 hack can be avoided by defining a new exit reason.

> 2) more important, there is no way to distinguish a failure due to the guest
> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
> to the KVM_SET_USER_MEMORY_REGION race condition.  So this will cause a
> guest that correctly caused an internal error to loop forever.

Userspace has the GPA and absolutely should be able to detect if the MMIO may have
been due to its memslot manipulation versus the guest jumping into the weeds.

> While the former could be handled in a "wait and see" manner, the latter in
> particular is part of the KVM_RUN contract.  Of course it is possible for a
> guest to just loop forever, but in general all of KVM, QEMU and upper
> userspace layers want a crashed guest to be detected and stopped forever.
> 
> Yes, QEMU could loop only if memslot updates are in progress, but honestly
> all the alternatives I have seen to atomic memslot updates are really
> *awful*.  David's patches even invent a new kind of mutex for which I have
> absolutely no idea what kind of deadlocks one should worry about and why
> they should not exist; QEMU's locking is already pretty crappy, it's
> certainly not on my wishlist to make it worse!
> 
> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
> kernel is the only place where you can have a *good* fix.  It should have
> been fixed years ago.

I don't disagree that the memslots API is lacking, but IMO that is somewhat
orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
like to have the luxury of being able to explore ideas beyond "let userspace
batch memslot updates", and I really don't want to feel pressured to get this
code reviewed and merge.

E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
do wholesale replacement?  That seems like it would be vastly simpler to handle
on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
API that allows 1->N or N->1 conversions but not arbitrary batching.

And just because QEMU's locking is "already pretty crappy", that's not a good
reason to drag KVM down into the mud.  E.g. taking a lock and conditionally
releasing it...  I get that this is an RFC, but IMO anything that requires such
shenanigans simply isn't acceptable.

  /*
   * Takes kvm->slots_arch_lock, and releases it only if
   * invalid_slot allocation, kvm_prepare_memory_region failed
   * or batch->is_move_delete is true.
   */
  static int kvm_prepare_memslot(struct kvm *kvm,
			         struct kvm_internal_memory_region_list *batch)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 15:58                         ` Sean Christopherson
@ 2022-09-28 16:38                           ` Paolo Bonzini
  2022-09-28 20:41                             ` Sean Christopherson
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 16:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Emanuele Giuseppe Esposito, David Hildenbrand, Maxim Levitsky,
	kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On 9/28/22 17:58, Sean Christopherson wrote:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/27/22 17:58, Sean Christopherson wrote:
>>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>>> the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
>>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>>> otherwise exit with mmio.len==0.
>>
>> I think this patch is not a good idea for two reasons:
>>
>> 1) we don't know how userspace behaves if mmio.len is zero.  It is of course
>> reasonable to do nothing, but an assertion failure is also a valid behavior
> 
> Except that KVM currently does neither.  If the fetch happens at CPL>0 and/or in
> L2, KVM injects #UD.  That's flat out architecturally invalid.  If it's a sticking
> point, the mmio.len==0 hack can be avoided by defining a new exit reason.

I agree that doing this at CPL>0 or in L2 is invalid and makes little 
sense (because either way the invalid address cannot be reached without 
help from the supervisor or L1's page tables).

>> 2) more important, there is no way to distinguish a failure due to the guest
>> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
>> to the KVM_SET_USER_MEMORY_REGION race condition.  So this will cause a
>> guest that correctly caused an internal error to loop forever.
> 
> Userspace has the GPA and absolutely should be able to detect if the MMIO may have
> been due to its memslot manipulation versus the guest jumping into the weeds.
> 
>> While the former could be handled in a "wait and see" manner, the latter in
>> particular is part of the KVM_RUN contract.  Of course it is possible for a
>> guest to just loop forever, but in general all of KVM, QEMU and upper
>> userspace layers want a crashed guest to be detected and stopped forever.
>>
>> Yes, QEMU could loop only if memslot updates are in progress, but honestly
>> all the alternatives I have seen to atomic memslot updates are really
>> *awful*.  David's patches even invent a new kind of mutex for which I have
>> absolutely no idea what kind of deadlocks one should worry about and why
>> they should not exist; QEMU's locking is already pretty crappy, it's
>> certainly not on my wishlist to make it worse!
>>
>> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
>> kernel is the only place where you can have a *good* fix.  It should have
>> been fixed years ago.
> 
> I don't disagree that the memslots API is lacking, but IMO that is somewhat
> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> like to have the luxury of being able to explore ideas beyond "let userspace
> batch memslot updates", and I really don't want to feel pressured to get this
> code reviewed and merge.

I absolutely agree that this is not a bugfix.  Most new features for KVM 
can be seen as bug fixes if you squint hard enough, but they're still 
features.

> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> do wholesale replacement?  That seems like it would be vastly simpler to handle
> on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
> API that allows 1->N or N->1 conversions but not arbitrary batching.

Wholesale replacement was my first idea when I looked at the issue, I 
think at the end of 2020.  I never got to a full implementation, but my 
impression was that allocating/deallocating dirty bitmaps, rmaps etc. 
would make it any easier than arbitrary batch updates.

> And just because QEMU's locking is "already pretty crappy", that's not a good
> reason to drag KVM down into the mud.  E.g. taking a lock and conditionally
> releasing it...  I get that this is an RFC, but IMO anything that requires such
> shenanigans simply isn't acceptable.
> 
>    /*
>     * Takes kvm->slots_arch_lock, and releases it only if
>     * invalid_slot allocation, kvm_prepare_memory_region failed
>     * or batch->is_move_delete is true.
>     */
>    static int kvm_prepare_memslot(struct kvm *kvm,
> 			         struct kvm_internal_memory_region_list *batch)
> 

No objection about that. :)

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region
  2022-09-09 10:44 ` [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region Emanuele Giuseppe Esposito
@ 2022-09-28 16:41   ` Paolo Bonzini
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 16:41 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:44, Emanuele Giuseppe Esposito wrote:
> And make kvm_set_memory_region static, since it is not used outside
> kvm_main.c
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   include/linux/kvm_host.h |  2 --
>   virt/kvm/kvm_main.c      | 11 +++++------
>   2 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3b40f8d68fbb..1c5b7b2e35dd 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1108,8 +1108,6 @@ enum kvm_mr_change {
>   	KVM_MR_FLAGS_ONLY,
>   };
>   
> -int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem);
>   int __kvm_set_memory_region(struct kvm *kvm,
>   			    const struct kvm_userspace_memory_region *mem);
>   void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index da263c370d00..339de0ed4557 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2007,24 +2007,23 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   }
>   EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>   
> -int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem)
> +static int kvm_set_memory_region(struct kvm *kvm,
> +				 const struct kvm_userspace_memory_region *mem)
>   {
>   	int r;
>   
> +	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> +		return -EINVAL;
> +
>   	mutex_lock(&kvm->slots_lock);
>   	r = __kvm_set_memory_region(kvm, mem);
>   	mutex_unlock(&kvm->slots_lock);
>   	return r;
>   }
> -EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>   
>   static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
>   					  struct kvm_userspace_memory_region *mem)
>   {
> -	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> -		return -EINVAL;
> -
>   	return kvm_set_memory_region(kvm, mem);
>   }
>   

The idea here was that kvm_set_memory_region could be used to set 
private memory slots while not taking kvm->slots_lock.

So, I would instead:

1) rename __kvm_set_memory_region to kvm_set_memory_region;

2) inline the old kvm_set_memory_region into kvm_vm_ioctl_set_memory_region.

3) replace the comment "Must be called holding kvm->slots_lock for 
write." with a proper lockdep_assert_held() now that the function 
doesn't have the __ warning sign in front of it.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl
  2022-09-09 10:44 ` [RFC PATCH 2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl Emanuele Giuseppe Esposito
@ 2022-09-28 16:42   ` Paolo Bonzini
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 16:42 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:44, Emanuele Giuseppe Esposito wrote:
> This IOCTL enables atomic update of multiple memslots.
> The userspace application provides a kvm_userspace_memory_region_list
> containing a list of entries, each representing a modification to be
> performed to a memslot.
> 
> Requests with invalidate_slot == 1 are pre-processed, because they
> are ther DELETE or MOVE, and therefore the memslot must be first
> replaced with a copy marked as KVM_MEMSLOT_INVALID, and then replaced.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   include/uapi/linux/kvm.h | 21 ++++++++++++++++++++-
>   1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a36e78710382..673496b91a25 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,24 @@ struct kvm_userspace_memory_region {
>   	__u64 userspace_addr; /* start of the userspace allocated memory */
>   };
>   
> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_entry {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size; /* bytes */
> +	__u64 userspace_addr; /* start of the userspace allocated memory */
> +	__u8 invalidate_slot;
> +	__u8 padding[31];
> +};
> +
> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_list {
> +	__u32 nent;
> +	__u32 flags;
> +	struct kvm_userspace_memory_region_entry entries[0];
> +};
> +
>   /*
>    * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>    * other bits are reserved for kvm internal use which are defined in
> @@ -1444,7 +1462,8 @@ struct kvm_vfio_spapr_tce {
>   					struct kvm_userspace_memory_region)
>   #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>   #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> -
> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> +					struct kvm_userspace_memory_region_list)
>   /* enable ucontrol for s390 */
>   struct kvm_s390_ucas_mapping {
>   	__u64 user_addr;

Looks good; however, in the non-RFC I suggest adding documentation in 
this patch already (so no Reviewed-by yet).

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 3/9] kvm_main.c: introduce kvm_internal_memory_region_list
  2022-09-09 10:45 ` [RFC PATCH 3/9] kvm_main.c: introduce kvm_internal_memory_region_list Emanuele Giuseppe Esposito
@ 2022-09-28 16:48   ` Paolo Bonzini
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 16:48 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> +struct kvm_internal_memory_region_list {
> +	/* Fields initialized in __kvm_set_memory_region() */
> +	struct kvm_memory_slot *old;
> +	struct kvm_memory_slot *new;
> +	struct kvm_memory_slot *invalid;
> +	enum kvm_mr_change change;
> +};
> +

Alternative name: kvm_memslot_change

>  int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem);
> +			    const struct kvm_userspace_memory_region *mem,
> +			    struct kvm_internal_memory_region_list *batch);

A bit weird to have this passed by the caller.  I'd rather have 
__kvm_set_memory_region with the *batch argument (which I'd call instead 
*change), and design kvm_set_memory_region so that, at the end of the 
series, it is something like:

int kvm_set_memory_region(struct kvm *kvm,
			  const struct kvm_userspace_memory_region *mem)
{
	struct kvm_memslot_change change = {};
	r = __kvm_set_memory_region(kvm, mem, &change);
	if (r < 0)
		return r;
	r = invalidate_memslot_for_change(kvm, &change);
	if (r < 0)
		return r;
	return commit_memslot_change(kvm, &change);
}

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/9] kvm_main.c: split logic in kvm_set_memslots
  2022-09-09 10:45 ` [RFC PATCH 4/9] kvm_main.c: split logic in kvm_set_memslots Emanuele Giuseppe Esposito
@ 2022-09-28 17:04   ` Paolo Bonzini
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 17:04 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> +/*
> + * Takes kvm->slots_arch_lock, and releases it only if
> + * invalid_slot allocation or kvm_prepare_memory_region failed.
> +*/

Much simpler: "kvm->slots_arch_lock is taken on a successful return."

This is a small change in phrasing, but it makes a lot more sense: on 
success you are preparing for the final commit operation, otherwise you 
just want the caller to return your errno value.

[...]

> +/* Must be called with kvm->slots_arch_lock held, but releases it. */
s/but/and/.  Even better, "and releases it before returning".  "But" 
tells the reader that something strange is going on, "and" tells it that 
something simple is going on.

I would also rename the functions along the lines of my review to patch 
3, to highlight that these function prepare/commit a *change* to a memslot.

> +static void kvm_finish_memslot(struct kvm *kvm,
> +			       struct kvm_internal_memory_region_list *batch)
> +{
> +	struct kvm_memory_slot *invalid_slot = batch->invalid;
> +	struct kvm_memory_slot *old = batch->old;
> +	struct kvm_memory_slot *new = batch->new;
> +	enum kvm_mr_change change = batch->change;

lockdep_assert_held(&kvm->slots_arch_lock);

>   
>   	/*
>   	 * For DELETE and MOVE, the working slot is now active as the INVALID
> @@ -1883,6 +1898,18 @@ static int kvm_set_memslot(struct kvm *kvm,
>   	 * responsible for knowing that new->arch may be stale.
>   	 */
>   	kvm_commit_memory_region(kvm, batch);
> +}
> +
> +static int kvm_set_memslot(struct kvm *kvm,
> +			   struct kvm_internal_memory_region_list *batch)
> +{
> +	int r;
> +
> +	r = kvm_prepare_memslot(kvm, batch);
> +	if (r)
> +		return r;
> +
> +	kvm_finish_memslot(kvm, batch);
>   
>   	return 0;

Ok, these are the functions I hinted at in the review of the previous 
patch, so we're not far away.  You should be able to move the 
kvm_set_memslot call to kvm_set_memory_region in patch 3, and then 
replace it with the two calls here directly in kvm_set_memory_region.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch
  2022-09-09 10:45 ` [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch Emanuele Giuseppe Esposito
  2022-09-13  2:56   ` Yang, Weijiang
@ 2022-09-28 17:11   ` Paolo Bonzini
  1 sibling, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 17:11 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> Just a function split. No functional change intended,
> except for the fact that kvm_prepare_batch() does not
> immediately call kvm_set_memslot() if batch->change is
> KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   virt/kvm/kvm_main.c | 120 +++++++++++++++++++++++++++++---------------
>   1 file changed, 79 insertions(+), 41 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 17f07546d591..9d917af30593 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1927,19 +1927,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   	return false;
>   }
>   
> -/*
> - * Allocate some memory and give it an address in the guest physical address
> - * space.
> - *
> - * Discontiguous memory is allowed, mostly for framebuffers.
> - * This function takes also care of initializing batch->new/old/invalid/change
> - * fields.
> - *
> - * Must be called holding kvm->slots_lock for write.
> - */
> -int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem,
> -			    struct kvm_internal_memory_region_list *batch)
> +static int kvm_prepare_batch(struct kvm *kvm,
> +			     const struct kvm_userspace_memory_region *mem,
> +			     struct kvm_internal_memory_region_list *batch)
>   {
>   	struct kvm_memory_slot *old, *new;
>   	struct kvm_memslots *slots;
> @@ -1947,34 +1937,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	unsigned long npages;
>   	gfn_t base_gfn;
>   	int as_id, id;
> -	int r;
> -
> -	r = check_memory_region_flags(mem);
> -	if (r)
> -		return r;
>   
>   	as_id = mem->slot >> 16;
>   	id = (u16)mem->slot;
>   
> -	/* General sanity checks */
> -	if ((mem->memory_size & (PAGE_SIZE - 1)) ||
> -	    (mem->memory_size != (unsigned long)mem->memory_size))
> -		return -EINVAL;
> -	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> -		return -EINVAL;
> -	/* We can read the guest memory with __xxx_user() later on. */
> -	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> -	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> -	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> -			mem->memory_size))
> -		return -EINVAL;
> -	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> -		return -EINVAL;
> -	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> -		return -EINVAL;
> -	if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> -		return -EINVAL;
> -
>   	slots = __kvm_memslots(kvm, as_id);
>   
>   	/*
> @@ -1993,7 +1959,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   
>   		batch->change = KVM_MR_DELETE;
>   		batch->new = NULL;
> -		return kvm_set_memslot(kvm, batch);
> +		return 0;
>   	}
>   
>   	base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
> @@ -2020,7 +1986,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   		else if (mem->flags != old->flags)
>   			change = KVM_MR_FLAGS_ONLY;
>   		else /* Nothing to change. */
> -			return 0;
> +			return 1;
>   	}
>   
>   	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
> @@ -2041,12 +2007,81 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   
>   	batch->new = new;
>   	batch->change = change;
> +	return 0;
> +}
> +
> +static int kvm_check_mem(const struct kvm_userspace_memory_region *mem)
> +{
> +	int as_id, id;
> +
> +	as_id = mem->slot >> 16;
> +	id = (u16)mem->slot;
> +
> +	/* General sanity checks */
> +	if ((mem->memory_size & (PAGE_SIZE - 1)) ||
> +	    (mem->memory_size != (unsigned long)mem->memory_size))
> +		return -EINVAL;
> +	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> +		return -EINVAL;
> +	/* We can read the guest memory with __xxx_user() later on. */
> +	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> +	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> +	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> +			mem->memory_size))
> +		return -EINVAL;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> +		return -EINVAL;
> +	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> +		return -EINVAL;
> +	if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int kvm_check_memory_region(struct kvm *kvm,
> +				const struct kvm_userspace_memory_region *mem,
> +				struct kvm_internal_memory_region_list *batch)
> +{
> +	int r;
> +
> +	r = check_memory_region_flags(mem);
> +	if (r)
> +		return r;
>   
> -	r = kvm_set_memslot(kvm, batch);
> +	r = kvm_check_mem(mem);
>   	if (r)
> -		kfree(new);
> +		return r;
> +
> +	r = kvm_prepare_batch(kvm, mem, batch);
> +	if (r && batch->new)
> +		kfree(batch->new);
> +
>   	return r;
>   }
> +
> +/*
> + * Allocate some memory and give it an address in the guest physical address
> + * space.
> + *
> + * Discontiguous memory is allowed, mostly for framebuffers.
> + * This function takes also care of initializing batch->new/old/invalid/change
> + * fields.
> + *
> + * Must be called holding kvm->slots_lock for write.
> + */
> +int __kvm_set_memory_region(struct kvm *kvm,
> +			    const struct kvm_userspace_memory_region *mem,
> +			    struct kvm_internal_memory_region_list *batch)
> +{
> +	int r;
> +
> +	r = kvm_check_memory_region(kvm, mem, batch);
> +	if (r)
> +		return r;
> +
> +	return kvm_set_memslot(kvm, batch);
> +}
>   EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>   
>   static int kvm_set_memory_region(struct kvm *kvm,
> @@ -2061,6 +2096,9 @@ static int kvm_set_memory_region(struct kvm *kvm,
>   	mutex_lock(&kvm->slots_lock);
>   	r = __kvm_set_memory_region(kvm, mem, batch);
>   	mutex_unlock(&kvm->slots_lock);
> +	/* r == 1 means empty request, nothing to do but no error */
> +	if (r == 1)
> +		r = 0;

This is weird...  I think you have the order of the split slightly 
messed up.  Doing this check earlier, roughly at the same time as the 
introduction of the new struct, is probably clearer.

After having reviewed more of the code, I do think you should 
disaggregate __kvm_set_memory_region() in separate functions (check, 
build entry, prepare, commit) completely.  kvm_set_memory_region() calls 
them  and __kvm_set_memory_region() disappears completely.

Paolo

>   	return r;
>   }
>   


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 7/9] kvm_main.c: duplicate invalid memslot also in inactive list
  2022-09-09 10:45 ` [RFC PATCH 7/9] kvm_main.c: duplicate invalid memslot also in inactive list Emanuele Giuseppe Esposito
@ 2022-09-28 17:18   ` Paolo Bonzini
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 17:18 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
>   	/*
> -	 * if change is DELETE or MOVE, invalid is in active memslots
> -	 * and old in inactive, so replace old with new.
> +	 * if change is DELETE or MOVE, invalid is in both active and inactive
> +	 * memslot list. This means that we don't need old anymore, and
> +	 * we should replace invalid with new.
>   	 */
> -	kvm_replace_memslot(kvm, batch->old, batch->new);
> +	if (batch->change == KVM_MR_DELETE || batch->change == KVM_MR_MOVE)
> +		kvm_replace_memslot(kvm, batch->invalid, batch->new);
> +	else
> +		kvm_replace_memslot(kvm, batch->old, batch->new);

This is also

	kvm_replace_memslot(kvm, batch->invalid ?: batch->old,
			    batch->new);

with no need to look at batch->change.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update
  2022-09-09 10:45 ` [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update Emanuele Giuseppe Esposito
  2022-09-13  2:30   ` Yang, Weijiang
  2022-09-27  7:46   ` David Hildenbrand
@ 2022-09-28 17:29   ` Paolo Bonzini
  2 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-28 17:29 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, Maxim Levitsky, x86,
	H. Peter Anvin, linux-kernel

On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> @@ -1782,7 +1782,8 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
>   
>   /*
>    * Takes kvm->slots_arch_lock, and releases it only if
> - * invalid_slot allocation or kvm_prepare_memory_region failed.
> + * invalid_slot allocation, kvm_prepare_memory_region failed
> + * or batch->is_move_delete is true.
>    */

This _is_ getting out of hand, though. :)  It is not clear to me why you 
need to do multiple swaps.  If you did a single swap, you could do all 
the allocations of invalid slots in a separate loop, called with 
slots_arch_lock taken and not released until the final swap.  In other 
words, something like

	mutex_lock(&kvm->slots_arch_lock);
	r = create_invalid_slots();
	if (r < 0)
		return r;

	replace_old_slots();

	// also handles abort on failure
	prepare_memory_regions();
	if (r < 0)
		return r;
	swap();
	finish_memslots();

where each function is little more than a loop over the corresponding 
function called by KVM_SET_MEMORY_REGION.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 16:38                           ` Paolo Bonzini
@ 2022-09-28 20:41                             ` Sean Christopherson
  2022-09-29  8:05                               ` Emanuele Giuseppe Esposito
  2022-09-29 15:28                               ` Paolo Bonzini
  0 siblings, 2 replies; 58+ messages in thread
From: Sean Christopherson @ 2022-09-28 20:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Emanuele Giuseppe Esposito, David Hildenbrand, Maxim Levitsky,
	kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/28/22 17:58, Sean Christopherson wrote:
> > I don't disagree that the memslots API is lacking, but IMO that is somewhat
> > orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
> > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> > like to have the luxury of being able to explore ideas beyond "let userspace
> > batch memslot updates", and I really don't want to feel pressured to get this
> > code reviewed and merge.
> 
> I absolutely agree that this is not a bugfix.  Most new features for KVM can
> be seen as bug fixes if you squint hard enough, but they're still features.

I guess I'm complaining that there isn't sufficient justification for this new
feature.  The cover letter provides a bug that would be fixed by having batched
updates, but as above, that's really due to deficiencies in a different KVM ABI.

Beyond that, there's no explanation of why this exact API is necessary, i.e. there
are no requirements given.

  - Why can't this be solved in userspace?

  - Is performance a concern?  I.e. are updates that need to be batched going to
    be high frequency operations?

  - What operations does userspace truly need?  E.g. if the only use case is to
    split/truncate/hole punch an existing memslot, can KVM instead provide a
    memslot flag and exit reason that allows kicking vCPUs to userspace if the
    memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
    but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

  - What precisely needs to be atomic?  If updates only need to be "atomic" for
    an address space, does the API allowing mixing non-SMM and SMM memslots?

  - When is KVM required to invalidate and flush?  E.g. if a memslot is deleted
    and recreated with a different HVA, how does KVM ensure that there are no
    outstanding references to the old HVA without introducing non-determinstic
    behavior.  The current API works by forcing userspace to fully delete the
    memslot, i.e. KVM can ensure all references are gone in all TLBs before
    allowing userspace to create new, conflicting entries.  I don't see how this
    can work with batched updates.  The update needs to be "atomic", i.e. vCPUs
    must never see an invalid/deleted memslot, but if the memslot is writable,
    how does KVM prevent some writes from hitting the old HVA and some from hitting
    the new HVA without a quiescent period?

  - If a memslot is truncated while dirty logging is enabled, what happens to
    the bitmap?  Is it preserved?  Dropped?

Again, I completely agree that the current memslots API is far from perfect, but
I'm not convinced that simply extending the existing API to batch updates is the
best solution from a KVM perspective.

> > E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> > do wholesale replacement?  That seems like it would be vastly simpler to handle
> > on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
> > API that allows 1->N or N->1 conversions but not arbitrary batching.
> 
> Wholesale replacement was my first idea when I looked at the issue, I think
> at the end of 2020.  I never got to a full implementation, but my impression
> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> easier than arbitrary batch updates.

It's not obvious to me that the memslot metadata is going to be easy to handle
regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
the dirty bitmap if a hole is punched in a memslot that's being dirty logged.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 20:41                             ` Sean Christopherson
@ 2022-09-29  8:05                               ` Emanuele Giuseppe Esposito
  2022-09-29  8:24                                 ` David Hildenbrand
  2022-09-29 15:18                                 ` Sean Christopherson
  2022-09-29 15:28                               ` Paolo Bonzini
  1 sibling, 2 replies; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-29  8:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: David Hildenbrand, Maxim Levitsky, kvm, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-kernel, Like Xu



Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/28/22 17:58, Sean Christopherson wrote:
>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>> batch memslot updates", and I really don't want to feel pressured to get this
>>> code reviewed and merge.
>>
>> I absolutely agree that this is not a bugfix.  Most new features for KVM can
>> be seen as bug fixes if you squint hard enough, but they're still features.
> 
> I guess I'm complaining that there isn't sufficient justification for this new
> feature.  The cover letter provides a bug that would be fixed by having batched
> updates, but as above, that's really due to deficiencies in a different KVM ABI.
> 
> Beyond that, there's no explanation of why this exact API is necessary, i.e. there
> are no requirements given.
> 
>   - Why can't this be solved in userspace?

Because this would provide the "feature" only to QEMU, leaving each
other hypervisor to implement its own.

In addition (maybe you already answered this question but I couldn't
find an answer in the email thread), does it make sense to stop all
vcpus for a couple of memslot update? What if we have 100 cpus?

> 
>   - Is performance a concern?  I.e. are updates that need to be batched going to
>     be high frequency operations?

Currently they are limited to run only at boot. In an unmodified
KVM/QEMU build, however, I count 86 memslot updates done at boot with

./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none

> 
>   - What operations does userspace truly need?  E.g. if the only use case is to
>     split/truncate/hole punch an existing memslot, can KVM instead provide a
>     memslot flag and exit reason that allows kicking vCPUs to userspace if the
>     memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
>     but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

Could be an alternative solution I guess?
> 
>   - What precisely needs to be atomic?  If updates only need to be "atomic" for
>     an address space, does the API allowing mixing non-SMM and SMM memslots?

Does QEMU pass a list of mixed non-SMM and SMM memslots?
Btw, is there a separate list for SMM memslots in KVM?
If not, and are all thrown together, then for simplicity we make all
updates atomic.

> 
>   - When is KVM required to invalidate and flush?  E.g. if a memslot is deleted
>     and recreated with a different HVA, how does KVM ensure that there are no
>     outstanding references to the old HVA without introducing non-determinstic
>     behavior.  The current API works by forcing userspace to fully delete the
>     memslot, i.e. KVM can ensure all references are gone in all TLBs before
>     allowing userspace to create new, conflicting entries.  I don't see how this
>     can work with batched updates.  The update needs to be "atomic", i.e. vCPUs
>     must never see an invalid/deleted memslot, but if the memslot is writable,
>     how does KVM prevent some writes from hitting the old HVA and some from hitting
>     the new HVA without a quiescent period?

Sorry this might be my fault in providing definitions, even though I
think I made plenty of examples. Delete/move operations are not really
"atomic" in the sense that the slot disappears immediately.

The slot(s) is/are first "atomically" invalidated, allowing the guest to
retry the page fault and preparing for the cleanup you mention above,
and then are "atomically" deleted.
The behavior is the same, just instead of doing
invalidate memslot X
swap()
delete memslot X
swap()
invalidate memslot Y
swap()
delete memslot Y
swap()
...

I would do:

invalidate
invalidate
invalidate
swap()
delete
delete
delete
swap()
> 
>   - If a memslot is truncated while dirty logging is enabled, what happens to
>     the bitmap?  Is it preserved?  Dropped?

Can you explain what you mean with "truncated memslot"?
Regarding the bitmap, currently QEMU should (probably wrongly) update it
before even committing the changes to KVM. But I remember upstream
someone pointed that this could be solved later.

> 
> Again, I completely agree that the current memslots API is far from perfect, but
> I'm not convinced that simply extending the existing API to batch updates is the
> best solution from a KVM perspective.
> 
>>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
>>> do wholesale replacement?  That seems like it would be vastly simpler to handle
>>> on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
>>> API that allows 1->N or N->1 conversions but not arbitrary batching.
>>
>> Wholesale replacement was my first idea when I looked at the issue, I think
>> at the end of 2020.  I never got to a full implementation, but my impression
>> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
>> easier than arbitrary batch updates.
> 
> It's not obvious to me that the memslot metadata is going to be easy to handle
> regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> 

Could you provide an example?
I mean I am not an expert but to me it looks like I preserved the same
old functionalities both here and in QEMU. I just batched and delayed
some operations, which in this case should cause no harm.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29  8:05                               ` Emanuele Giuseppe Esposito
@ 2022-09-29  8:24                                 ` David Hildenbrand
  2022-09-29 15:18                                 ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2022-09-29  8:24 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu

On 29.09.22 10:05, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
>> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>>> On 9/28/22 17:58, Sean Christopherson wrote:
>>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
>>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>>> batch memslot updates", and I really don't want to feel pressured to get this
>>>> code reviewed and merge.
>>>
>>> I absolutely agree that this is not a bugfix.  Most new features for KVM can
>>> be seen as bug fixes if you squint hard enough, but they're still features.
>>
>> I guess I'm complaining that there isn't sufficient justification for this new
>> feature.  The cover letter provides a bug that would be fixed by having batched
>> updates, but as above, that's really due to deficiencies in a different KVM ABI.
>>
>> Beyond that, there's no explanation of why this exact API is necessary, i.e. there
>> are no requirements given.
>>
>>    - Why can't this be solved in userspace?
> 
> Because this would provide the "feature" only to QEMU, leaving each
> other hypervisor to implement its own.
> 
> In addition (maybe you already answered this question but I couldn't
> find an answer in the email thread), does it make sense to stop all
> vcpus for a couple of memslot update? What if we have 100 cpus?
> 
>>
>>    - Is performance a concern?  I.e. are updates that need to be batched going to
>>      be high frequency operations?
> 
> Currently they are limited to run only at boot. In an unmodified
> KVM/QEMU build, however, I count 86 memslot updates done at boot with
> 
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none

I *think* there are only ~3 problematic ones (split/resize), where we 
temporarily delete something we will re-add. At least that's what I 
remember from working on my prototype.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29  8:05                               ` Emanuele Giuseppe Esposito
  2022-09-29  8:24                                 ` David Hildenbrand
@ 2022-09-29 15:18                                 ` Sean Christopherson
  2022-09-29 15:41                                   ` Paolo Bonzini
  1 sibling, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-09-29 15:18 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Paolo Bonzini, David Hildenbrand, Maxim Levitsky, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Thu, Sep 29, 2022, Emanuele Giuseppe Esposito wrote:
> 
> Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
> > Beyond that, there's no explanation of why this exact API is necessary, i.e. there
> > are no requirements given.
> > 
> >   - Why can't this be solved in userspace?
> 
> Because this would provide the "feature" only to QEMU, leaving each
> other hypervisor to implement its own.

But there's no evidence that any other VMM actually needs this feature.

> >   - When is KVM required to invalidate and flush?  E.g. if a memslot is deleted
> >     and recreated with a different HVA, how does KVM ensure that there are no
> >     outstanding references to the old HVA without introducing non-determinstic
> >     behavior.  The current API works by forcing userspace to fully delete the
> >     memslot, i.e. KVM can ensure all references are gone in all TLBs before
> >     allowing userspace to create new, conflicting entries.  I don't see how this
> >     can work with batched updates.  The update needs to be "atomic", i.e. vCPUs
> >     must never see an invalid/deleted memslot, but if the memslot is writable,
> >     how does KVM prevent some writes from hitting the old HVA and some from hitting
> >     the new HVA without a quiescent period?
> 
> Sorry this might be my fault in providing definitions, even though I
> think I made plenty of examples. Delete/move operations are not really
> "atomic" in the sense that the slot disappears immediately.
> 
> The slot(s) is/are first "atomically" invalidated, allowing the guest to
> retry the page fault

I completely forgot that KVM x86 now retries on KVM_MEMSLOT_INVALID[*].  That is
somewhat arbitrary behavior, e.g. if x86 didn't do MMIO SPTE caching it wouldn't
be "necessary" and x86 would still treat INVALID as MMIO.  It's also x86 specific,
no other architecture retries on invalid memslots, i.e. relying on that behavior
to provide "atomicity" won't work without updating all other architectures.

That's obviously doable, but since these updates aren't actually atomic, I don't
see any meaningful benefit over adding e.g. KVM_MEM_DISABLED.

[*] e0c378684b65 ("KVM: x86/mmu: Retry page faults that hit an invalid memslot")

> >   - If a memslot is truncated while dirty logging is enabled, what happens to
> >     the bitmap?  Is it preserved?  Dropped?
> 
> Can you explain what you mean with "truncated memslot"?

Shrink the size of the memslot.

> Regarding the bitmap, currently QEMU should (probably wrongly) update it
> before even committing the changes to KVM. But I remember upstream
> someone pointed that this could be solved later.

These details can't be punted, as they affect the ABI.  My interpretation of
"atomic" memslot updates was that KVM would truly honor the promise of atomic
updates, e.g. that a DELETE=>CREATE sequence would effectively allow truncating
a memslot without losing any data.  Those details aren't really something that
can be punted down the road to be fixed at a later point because they very much
matter from an ABI perspective.

> > Again, I completely agree that the current memslots API is far from perfect, but
> > I'm not convinced that simply extending the existing API to batch updates is the
> > best solution from a KVM perspective.
> > 
> >>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> >>> do wholesale replacement?  That seems like it would be vastly simpler to handle
> >>> on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
> >>> API that allows 1->N or N->1 conversions but not arbitrary batching.
> >>
> >> Wholesale replacement was my first idea when I looked at the issue, I think
> >> at the end of 2020.  I never got to a full implementation, but my impression
> >> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> >> easier than arbitrary batch updates.
> > 
> > It's not obvious to me that the memslot metadata is going to be easy to handle
> > regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> > the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> > 
> 
> Could you provide an example?
> I mean I am not an expert but to me it looks like I preserved the same
> old functionalities both here and in QEMU. I just batched and delayed
> some operations, which in this case should cause no harm.

As above, my confusion is due to calling these updates "atomic".  The new API is
really just "allow batching memslot updates and subtly rely on restarting page
faults on KVM_MEMSLOT_INVALID".

IMO, KVM_MEM_DISABLED or similar is the way to go.  I.e. formalize the "restart
page faults" semantics without taking on the complexity of batched updates.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-28 20:41                             ` Sean Christopherson
  2022-09-29  8:05                               ` Emanuele Giuseppe Esposito
@ 2022-09-29 15:28                               ` Paolo Bonzini
  2022-09-29 15:40                                 ` Maxim Levitsky
                                                   ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-29 15:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Emanuele Giuseppe Esposito, David Hildenbrand, Maxim Levitsky,
	kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu


On 9/28/22 22:41, Sean Christopherson wrote:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/28/22 17:58, Sean Christopherson wrote:
>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>> batch memslot updates", and I really don't want to feel pressured to get this
>>> code reviewed and merge.
>>
>> I absolutely agree that this is not a bugfix.  Most new features for KVM can
>> be seen as bug fixes if you squint hard enough, but they're still features.
> 
> I guess I'm complaining that there isn't sufficient justification for this new
> feature.  The cover letter provides a bug that would be fixed by having batched
> updates, but as above, that's really due to deficiencies in a different KVM ABI.

I disagree.  Failure to fetch should be fixed but is otherwise a red 
herring.  It's the whole memslot API (including dirty logging) that is a 
mess.

If you think we should overhaul it even more than just providing atomic 
batched updates, that's fine.  But still, the impossibility to perform 
atomic updates in batches *is* a suboptimal part of the KVM API.

>    - Why can't this be solved in userspace?

I don't think *can't* is the right word.  If the metric of choice was 
"what can be solved in userspace", we'd all be using microkernels.  The 
question is why userspace would be a better place to solve it.

The only reason to do it in userspace would be if failure to fetch is 
something that is interesting to userspace, other than between two 
KVM_SET_USER_MEMORY_REGION.  Unless you provide an API to pass failures 
to fetch down to userspace, the locking in userspace is going to be 
inferior, because it would have to be unconditional.  This means worse 
performance and more complication, not to mention having to do it N 
times instead of 1 for N implementations.

Not forcing userspace to do "tricks" is in my opinion a strong part of 
deciding whether an API belongs in KVM.

>    - What operations does userspace truly need?  E.g. if the only use case is to
>      split/truncate/hole punch an existing memslot, can KVM instead provide a
>      memslot flag and exit reason that allows kicking vCPUs to userspace if the
>      memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
>      but KVM exists with a dedicated exit reason instead of generating MMIO semantics.

The main cases are:

- for the boot case, splitting and merging existing memslots.  QEMU 
likes to merge adjacent memory regions into a single memslot, so if 
something goes from read-write to read-only it has to be split and vice 
versa.  I guess a "don't merge this memory region" flag would be the 
less hideous way to solve it in userspace.

- however, there is also the case of resizing an existing memslot which 
is what David would like to have for virtio-mem.  This is not really 
fixable because part of the appeal of virtio-mem is to have a single 
huge memslot instead of many smaller ones, in order to reduce the 
granularity of add/remove (David, correct me if I'm wrong).

(The latter _would_ be needed by other VMMs).

> If updates only need to be "atomic" for an address space, does the API allowing
> mixing non-SMM and SMM memslots?

I agree that the address space should be moved out of the single entries 
and into the header if we follow through with this approach.

> The update needs to be "atomic", i.e. vCPUs
> must never see an invalid/deleted memslot, but if the memslot is writable,
> how does KVM prevent some writes from hitting the old HVA and some from hitting
> the new HVA without a quiescent period?

(Heh, and I forgot likewise that non-x86 does not retry on 
KVM_MEMSLOT_INVALID.  Yes, that would be treated as a bug on other 
architectures).

>> Wholesale replacement was my first idea when I looked at the issue, I think
>> at the end of 2020.  I never got to a full implementation, but my impression
>> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
>> easier than arbitrary batch updates.
> 
> It's not obvious to me that the memslot metadata is going to be easy to handle
> regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> the dirty bitmap if a hole is punched in a memslot that's being dirty logged.

Indeed; I would have thought that it is clear with the batch updates API 
(which requires the update to be split into delete and insert), but 
apparently it's not and it's by no means optimal.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29 15:28                               ` Paolo Bonzini
@ 2022-09-29 15:40                                 ` Maxim Levitsky
  2022-09-29 16:00                                 ` David Hildenbrand
  2022-09-29 21:39                                 ` Sean Christopherson
  2 siblings, 0 replies; 58+ messages in thread
From: Maxim Levitsky @ 2022-09-29 15:40 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Emanuele Giuseppe Esposito, David Hildenbrand, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Thu, 2022-09-29 at 17:28 +0200, Paolo Bonzini wrote:
> On 9/28/22 22:41, Sean Christopherson wrote:
> > On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> > > On 9/28/22 17:58, Sean Christopherson wrote:
> > > > I don't disagree that the memslots API is lacking, but IMO that is somewhat
> > > > orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
> > > > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> > > > like to have the luxury of being able to explore ideas beyond "let userspace
> > > > batch memslot updates", and I really don't want to feel pressured to get this
> > > > code reviewed and merge.
> > > 
> > > I absolutely agree that this is not a bugfix.  Most new features for KVM can
> > > be seen as bug fixes if you squint hard enough, but they're still features.
> > 
> > I guess I'm complaining that there isn't sufficient justification for this new
> > feature.  The cover letter provides a bug that would be fixed by having batched
> > updates, but as above, that's really due to deficiencies in a different KVM ABI.
> 
> I disagree.  Failure to fetch should be fixed but is otherwise a red 
> herring.  It's the whole memslot API (including dirty logging) that is a 
> mess.
> 
> If you think we should overhaul it even more than just providing atomic 
> batched updates, that's fine.  But still, the impossibility to perform 
> atomic updates in batches *is* a suboptimal part of the KVM API.
> 
> >    - Why can't this be solved in userspace?
> 
> I don't think *can't* is the right word.  If the metric of choice was 
> "what can be solved in userspace", we'd all be using microkernels.  The 
> question is why userspace would be a better place to solve it.
> 
> The only reason to do it in userspace would be if failure to fetch is 
> something that is interesting to userspace, other than between two 
> KVM_SET_USER_MEMORY_REGION.  Unless you provide an API to pass failures 
> to fetch down to userspace, the locking in userspace is going to be 
> inferior, because it would have to be unconditional.  This means worse 
> performance and more complication, not to mention having to do it N 
> times instead of 1 for N implementations.
> 
> Not forcing userspace to do "tricks" is in my opinion a strong part of 
> deciding whether an API belongs in KVM.
> 
> >    - What operations does userspace truly need?  E.g. if the only use case is to
> >      split/truncate/hole punch an existing memslot, can KVM instead provide a
> >      memslot flag and exit reason that allows kicking vCPUs to userspace if the
> >      memslot is accessed?  E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
> >      but KVM exists with a dedicated exit reason instead of generating MMIO semantics.
> 
> The main cases are:
> 
> - for the boot case, splitting and merging existing memslots.  QEMU 
> likes to merge adjacent memory regions into a single memslot, so if 
> something goes from read-write to read-only it has to be split and vice 
> versa.  I guess a "don't merge this memory region" flag would be the 
> less hideous way to solve it in userspace.
> 
> - however, there is also the case of resizing an existing memslot which 
> is what David would like to have for virtio-mem.  This is not really 
> fixable because part of the appeal of virtio-mem is to have a single 
> huge memslot instead of many smaller ones, in order to reduce the 
> granularity of add/remove (David, correct me if I'm wrong).
> 
> (The latter _would_ be needed by other VMMs).
> 
> > If updates only need to be "atomic" for an address space, does the API allowing
> > mixing non-SMM and SMM memslots?
> 
> I agree that the address space should be moved out of the single entries 
> and into the header if we follow through with this approach.
> 
> > The update needs to be "atomic", i.e. vCPUs
> > must never see an invalid/deleted memslot, but if the memslot is writable,
> > how does KVM prevent some writes from hitting the old HVA and some from hitting
> > the new HVA without a quiescent period?
> 
> (Heh, and I forgot likewise that non-x86 does not retry on 
> KVM_MEMSLOT_INVALID.  Yes, that would be treated as a bug on other 
> architectures).
> 
> > > Wholesale replacement was my first idea when I looked at the issue, I think
> > > at the end of 2020.  I never got to a full implementation, but my impression
> > > was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> > > easier than arbitrary batch updates.
> > 
> > It's not obvious to me that the memslot metadata is going to be easy to handle
> > regardless of what we do.  E.g. I'm pretty sure that batching updates will "corrupt"
> > the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> 
> Indeed; I would have thought that it is clear with the batch updates API 
> (which requires the update to be split into delete and insert), but 
> apparently it's not and it's by no means optimal.


I 100% agree with everything Paolo said.


Best regards,
	Maxim Levitsky
> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29 15:18                                 ` Sean Christopherson
@ 2022-09-29 15:41                                   ` Paolo Bonzini
  0 siblings, 0 replies; 58+ messages in thread
From: Paolo Bonzini @ 2022-09-29 15:41 UTC (permalink / raw)
  To: Sean Christopherson, Emanuele Giuseppe Esposito
  Cc: David Hildenbrand, Maxim Levitsky, kvm, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	linux-kernel, Like Xu

On 9/29/22 17:18, Sean Christopherson wrote:
> IMO, KVM_MEM_DISABLED or similar is the way to go.  I.e. formalize the "restart
> page faults" semantics without taking on the complexity of batched updates.

If userspace has to collaborate, KVM_MEM_DISABLED (or KVM_MEM_USER_EXIT 
would be a better name) is not needed either except as an optimization; 
you can just kick all CPUs unconditionally.

And in fact KVM_MEM_DISABLED is not particularly easy to implement 
either; in order to allow split/merge it should be possible for a new 
memslot to replace multiple disabled memslots, in order to allow 
merging, and to be only partially overlap the first/last disabled 
memslot it replaces.

None of this is allowed for other memslots, so exactly the same metadata 
complications exist as for other options such as wholesale replacement 
or batched updates.  The only semantics with a sane implementation would 
be to destroy the dirty bitmap of disabled memslots when they are 
replaced.  At least it would be possible for userspace to set 
KVM_MEM_DISABLED, issue KVM_GET_DIRTY_LOG and then finally create the 
new memslot.  That would be _correct_, but still not very appealing.

I don't exclude suffering from tunnel vision, but batched updates to me 
still seem to be the least bad option.

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29 15:28                               ` Paolo Bonzini
  2022-09-29 15:40                                 ` Maxim Levitsky
@ 2022-09-29 16:00                                 ` David Hildenbrand
  2022-09-29 21:39                                 ` Sean Christopherson
  2 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2022-09-29 16:00 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Emanuele Giuseppe Esposito, Maxim Levitsky, kvm,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

> The main cases are:
> 
> - for the boot case, splitting and merging existing memslots.  QEMU
> likes to merge adjacent memory regions into a single memslot, so if
> something goes from read-write to read-only it has to be split and vice
> versa.  I guess a "don't merge this memory region" flag would be the
> less hideous way to solve it in userspace.
> 
> - however, there is also the case of resizing an existing memslot which
> is what David would like to have for virtio-mem.  This is not really
> fixable because part of the appeal of virtio-mem is to have a single
> huge memslot instead of many smaller ones, in order to reduce the
> granularity of add/remove (David, correct me if I'm wrong).

Yes, the most important case I am working on in that regard is reducing 
the memslot size/overhead when only exposing comparatively little memory 
towards a VM using virtio-mem (say, a virtio-mem device that could grow 
to 1 TiB, but we initially only expose 1 GiB to the VM).

One approach I prototyped in the past (where my RFC for atomic updates 
came into play because I ran into this issue) to achieve that was 
dynamically growing (and eventually shrinking) a single memslot on demand.

An alternative [1] uses multiple individual memslots, and exposes them 
on demand. There, I have to make sure that QEMU won't merge adjacent 
memslots into a bigger one -- because otherwise, we'd again need atomic 
memslot updates again, which KVM, vhost-user, ... don't support. But in 
the future, I think we want to have that: if there is no fragmentation, 
we should only have a single large memslot and all memslot consumers 
should be able to cope with atomic updates.


So in any case, I will have good use for atomic memslot updates. Just 
like other hypervisors that (will) implement virtio-mem or something 
comparable :)

[1] https://lore.kernel.org/all/20211013103330.26869-1-david@redhat.com/T/#u

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29 15:28                               ` Paolo Bonzini
  2022-09-29 15:40                                 ` Maxim Levitsky
  2022-09-29 16:00                                 ` David Hildenbrand
@ 2022-09-29 21:39                                 ` Sean Christopherson
  2022-10-13  7:43                                   ` Emanuele Giuseppe Esposito
  2 siblings, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2022-09-29 21:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Emanuele Giuseppe Esposito, David Hildenbrand, Maxim Levitsky,
	kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, Like Xu

On Thu, Sep 29, 2022, Paolo Bonzini wrote:
> 
> On 9/28/22 22:41, Sean Christopherson wrote:
> If you think we should overhaul it even more than just providing atomic
> batched updates, that's fine.

Or maybe solve the problem(s) with more precision?  What I don't like about the
batching is that it's not "atomic", and AFAICT doesn't have line of sight to truly
becoming atomic.  When I hear "atomic" updates, I think "all transactions are
commited at once".  That is not what this proposed API does, and due to the TLB
flushing requirements, I don't see how it can be implemented.

> But still, the impossibility to perform
> atomic updates in batches *is* a suboptimal part of the KVM API.

I don't disagree, but I honestly don't see the difference between "batching" and
providing a KVM_MEM_USER_EXIT flag.  The batch API will provide better performance,
but I would be surprised if it's significant enough to matter.  I'm not saying
that KVM_MEM_USER_EXIT provides novel functionality, e.g. I think we're in
agreement that handling memslot metadata will suck regardless.  My point is that
it's simpler to implement in KVM, much easier to document, and for all intents and
purposes provides the same desired functionality: vCPUs that access in-flux memslots
are stalled and so userspace doesn't have to pause and rendezvous all vCPUs to
provide "atomic" updates.

> The main cases are:
> 
> - for the boot case, splitting and merging existing memslots.  QEMU likes to
> merge adjacent memory regions into a single memslot, so if something goes
> from read-write to read-only it has to be split and vice versa.  I guess a
> "don't merge this memory region" flag would be the less hideous way to solve
> it in userspace.
> 
> - however, there is also the case of resizing an existing memslot which is
> what David would like to have for virtio-mem.  This is not really fixable
> because part of the appeal of virtio-mem is to have a single huge memslot
> instead of many smaller ones, in order to reduce the granularity of
> add/remove (David, correct me if I'm wrong).
> 
> (The latter _would_ be needed by other VMMs).

If we really want to provide a better experience for userspace, why not provide
more primitives to address those specific use cases?  E.g. fix KVM's historic wart
of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:

  - Merge 2+ memory regions that are contiguous in GPA and HVA
  - Split a memory region into 2+ memory regions
  - Truncate a memory region
  - Grow a memory region

That would probably require more KVM code overall, but each operation would be more
tightly bounded and thus simpler to define.  And I think more precise APIs would
provide other benefits, e.g. growing a region wouldn't require first deleting the
current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
split, and truncate would have similar benefits, though they might be more
difficult to realize in practice.

What I really don't like about the "batch" API, aside from the "atomic" name, is
that it's too open ended and creates conflicts.  E.g. to allow "atomic" merging
after converting from RO=>RW, KVM needs to allow DELETE=>DELETE=>CREATE, and that
implies that each operation in the batch operates on the working state created by
the previous operations.

But if userspace provides a batch that does DELETE=>CREATE=>DELETE, naively hoisting
all "invalidations" to the front half of the "atomic" operations breaks the
ordering.  And there's a ridiculous amount of potential weirdness with MOVE=>MOVE
operations.  

KVM could solve those issues by disallowing MOVE and disallowing DELETE after
CREATE or FLAGS_ONLY, but then userspace is effectively restricted to operating
on a single contiguous chunk of memslots.  At that point, it seems that providing
primitives to do each macro operation is an even better experience for userspace.

In other words, make memslots a bit more CISC ;-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-09-29 21:39                                 ` Sean Christopherson
@ 2022-10-13  7:43                                   ` Emanuele Giuseppe Esposito
  2022-10-13  8:44                                     ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-13  7:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, David Hildenbrand
  Cc: Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu



Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
> If we really want to provide a better experience for userspace, why not provide
> more primitives to address those specific use cases?  E.g. fix KVM's historic wart
> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
> 
>   - Merge 2+ memory regions that are contiguous in GPA and HVA
>   - Split a memory region into 2+ memory regions
>   - Truncate a memory region
>   - Grow a memory region

I looked again at the code and specifically the use case that triggers
the crash in bugzilla. I *think* (correct me if I am wrong), that the
only operation that QEMU performs right now is "grow/shrink".

So *if* we want to go this way, we could start with a simple grow/shrink
API.

Even though we need to consider that this could bring additional
complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
performed one after the other (in my innocent fantasy I was expecting to
find 2 subsequent ioctls in the code), but in QEMU's
address_space_set_flatview(), which seems to first remove all regions
and then add them when changing flatviews.

address_space_update_topology_pass(as, old_view2, new_view, adding=false);
address_space_update_topology_pass(as, old_view2, new_view, adding=true);

I don't think we can change this, as other listeners also rely on such
ordering, but we can still batch all callback requests like I currently
do and process them in kvm_commit(), figuring there which operation is
which.

In other words, we would have something like:

region_del() --> DELETE memslot X -> add it to the queue of operations
region_del() --> DELETE memslot Y -> add it to the queue of operations
region_add() --> CREATE memslot X (size doubled) -> add it to the queue
of operations
region_add() --> CREATE memslot Y (size halved) -> add it to the queue
of operations
...
commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
Y (-size) -> issue 2 ioctls, GROW and SHRINK.

> That would probably require more KVM code overall, but each operation would be more
> tightly bounded and thus simpler to define.  And I think more precise APIs would
> provide other benefits, e.g. growing a region wouldn't require first deleting the
> current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
> split, and truncate would have similar benefits, though they might be more
> difficult to realize in practice.

So essentially grow would not require INVALIDATE. Makes sense, but would
it work also with shrink? I guess so, as the memslot is still present
(but shrinked) right?

Paolo, would you be ok with this smaller API? Probably just starting
with grow and shrink first.

I am not against any of the two approaches:
- my approach has the disadvantage that the list could be arbitrarily
long, and it is difficult to rollback the intermediate changes if
something breaks during the request processing (but could be simplified
by making kvm exit or crash).

- Sean approach could potentially provide more burden to the userspace,
as we need to figure which operation is which. Also from my
understanding split and merge won't be really straightforward to
implement, especially in userspace.

David, any concern from userspace prospective on this "CISC" approach?

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-10-13  7:43                                   ` Emanuele Giuseppe Esposito
@ 2022-10-13  8:44                                     ` David Hildenbrand
  2022-10-13 11:12                                       ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2022-10-13  8:44 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu

On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>> If we really want to provide a better experience for userspace, why not provide
>> more primitives to address those specific use cases?  E.g. fix KVM's historic wart
>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
>>
>>    - Merge 2+ memory regions that are contiguous in GPA and HVA
>>    - Split a memory region into 2+ memory regions
>>    - Truncate a memory region
>>    - Grow a memory region
> 
> I looked again at the code and specifically the use case that triggers
> the crash in bugzilla. I *think* (correct me if I am wrong), that the
> only operation that QEMU performs right now is "grow/shrink".

I remember that there were BUG reports where we'd actually split and run 
into that problem. Just don't have them at hand. I think they happened 
during early boot when the OS re-configured some PCI thingies.

> 
> So *if* we want to go this way, we could start with a simple grow/shrink
> API.
> 
> Even though we need to consider that this could bring additional
> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
> performed one after the other (in my innocent fantasy I was expecting to
> find 2 subsequent ioctls in the code), but in QEMU's
> address_space_set_flatview(), which seems to first remove all regions
> and then add them when changing flatviews.
> 
> address_space_update_topology_pass(as, old_view2, new_view, adding=false);
> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
> 
> I don't think we can change this, as other listeners also rely on such
> ordering, but we can still batch all callback requests like I currently
> do and process them in kvm_commit(), figuring there which operation is
> which.
> 
> In other words, we would have something like:
> 
> region_del() --> DELETE memslot X -> add it to the queue of operations
> region_del() --> DELETE memslot Y -> add it to the queue of operations
> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
> of operations
> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
> of operations
> ...
> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
> Y (-size) -> issue 2 ioctls, GROW and SHRINK.

I communicated resizes (region_resize()) to the notifier in patch #3 of 
https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/

You could use that independently of the remainder. It should be less 
controversial ;)

But I think it only tackles part of the more generic problem (split/merge).

> 
>> That would probably require more KVM code overall, but each operation would be more
>> tightly bounded and thus simpler to define.  And I think more precise APIs would
>> provide other benefits, e.g. growing a region wouldn't require first deleting the
>> current region, and so could avoid zapping SPTEs and destroying metadata.  Merge,
>> split, and truncate would have similar benefits, though they might be more
>> difficult to realize in practice.
> 
> So essentially grow would not require INVALIDATE. Makes sense, but would
> it work also with shrink? I guess so, as the memslot is still present
> (but shrinked) right?
> 
> Paolo, would you be ok with this smaller API? Probably just starting
> with grow and shrink first.
> 
> I am not against any of the two approaches:
> - my approach has the disadvantage that the list could be arbitrarily
> long, and it is difficult to rollback the intermediate changes if
> something breaks during the request processing (but could be simplified
> by making kvm exit or crash).
> 
> - Sean approach could potentially provide more burden to the userspace,
> as we need to figure which operation is which. Also from my
> understanding split and merge won't be really straightforward to
> implement, especially in userspace.
> 
> David, any concern from userspace prospective on this "CISC" approach?

In contrast to resizes in QEMU that only affect a single memory 
region/slot, splitting/merging is harder to factor out and communicate 
to a notifier. As an alternative, we could handle it in the commit stage 
in the notifier itself, similar to what my prototype does, and figure 
out what needs to be done in there and how to call the proper KVM 
interface (and which interface to call).

With virtio-mem (in the future) we might see merges of 2 slots into a 
single one, by closing a gap in-between them. In "theory" we could 
combine some updates into a single transaction. But it won't be 100s ...

I think I'd prefer an API that doesn't require excessive ad-hoc 
extensions later once something in QEMU changes.


I think in essence, all we care about is performing atomic changes that 
*have to be atomic*, because something we add during a transaction 
overlaps with something we remove during a transaction. Not necessarily 
all updates in a transaction!

My prototype essentially detects that scenario, but doesn't call new KVM 
interfaces to deal with these.

I assume once we take that into consideration, we can mostly assume that 
any such atomic updates (only of the regions that really have to be part 
of an atomic update) won't involve more than a handful of memory 
regions. We could add a sane KVM API limit.

And if we run into that API limit in QEMU, we can print a warning and do 
it non-atomically.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-10-13  8:44                                     ` David Hildenbrand
@ 2022-10-13 11:12                                       ` Emanuele Giuseppe Esposito
  2022-10-13 14:45                                         ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-13 11:12 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu



Am 13/10/2022 um 10:44 schrieb David Hildenbrand:
> On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>>> If we really want to provide a better experience for userspace, why
>>> not provide
>>> more primitives to address those specific use cases?  E.g. fix KVM's
>>> historic wart
>>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more
>>> ioctls to:
>>>
>>>    - Merge 2+ memory regions that are contiguous in GPA and HVA
>>>    - Split a memory region into 2+ memory regions
>>>    - Truncate a memory region
>>>    - Grow a memory region
>>
>> I looked again at the code and specifically the use case that triggers
>> the crash in bugzilla. I *think* (correct me if I am wrong), that the
>> only operation that QEMU performs right now is "grow/shrink".
> 
> I remember that there were BUG reports where we'd actually split and run
> into that problem. Just don't have them at hand. I think they happened
> during early boot when the OS re-configured some PCI thingies.

If you could point me where this is happening, it would be nice. So far
I could not find or see any split/merge operation.

> 
>>
>> So *if* we want to go this way, we could start with a simple grow/shrink
>> API.
>>
>> Even though we need to consider that this could bring additional
>> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
>> performed one after the other (in my innocent fantasy I was expecting to
>> find 2 subsequent ioctls in the code), but in QEMU's
>> address_space_set_flatview(), which seems to first remove all regions
>> and then add them when changing flatviews.
>>
>> address_space_update_topology_pass(as, old_view2, new_view,
>> adding=false);
>> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
>>
>> I don't think we can change this, as other listeners also rely on such
>> ordering, but we can still batch all callback requests like I currently
>> do and process them in kvm_commit(), figuring there which operation is
>> which.
>>
>> In other words, we would have something like:
>>
>> region_del() --> DELETE memslot X -> add it to the queue of operations
>> region_del() --> DELETE memslot Y -> add it to the queue of operations
>> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
>> of operations
>> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
>> of operations
>> ...
>> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
>> Y (-size) -> issue 2 ioctls, GROW and SHRINK.
> 
> I communicated resizes (region_resize()) to the notifier in patch #3 of
> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
> 
> You could use that independently of the remainder. It should be less
> controversial ;)
> 
> But I think it only tackles part of the more generic problem (split/merge).

Yes, very useful! Using that patch we would have a single place where to
issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since
the operation will be atomic (this time in the true meaning, since we
don't need INVALIDATE.

> 
>>
>>> That would probably require more KVM code overall, but each operation
>>> would be more
>>> tightly bounded and thus simpler to define.  And I think more precise
>>> APIs would
>>> provide other benefits, e.g. growing a region wouldn't require first
>>> deleting the
>>> current region, and so could avoid zapping SPTEs and destroying
>>> metadata.  Merge,
>>> split, and truncate would have similar benefits, though they might be
>>> more
>>> difficult to realize in practice.
>>
>> So essentially grow would not require INVALIDATE. Makes sense, but would
>> it work also with shrink? I guess so, as the memslot is still present
>> (but shrinked) right?
>>
>> Paolo, would you be ok with this smaller API? Probably just starting
>> with grow and shrink first.
>>
>> I am not against any of the two approaches:
>> - my approach has the disadvantage that the list could be arbitrarily
>> long, and it is difficult to rollback the intermediate changes if
>> something breaks during the request processing (but could be simplified
>> by making kvm exit or crash).
>>
>> - Sean approach could potentially provide more burden to the userspace,
>> as we need to figure which operation is which. Also from my
>> understanding split and merge won't be really straightforward to
>> implement, especially in userspace.
>>
>> David, any concern from userspace prospective on this "CISC" approach?
> 
> In contrast to resizes in QEMU that only affect a single memory
> region/slot, splitting/merging is harder to factor out and communicate
> to a notifier. As an alternative, we could handle it in the commit stage
> in the notifier itself, similar to what my prototype does, and figure
> out what needs to be done in there and how to call the proper KVM
> interface (and which interface to call).
> 
> With virtio-mem (in the future) we might see merges of 2 slots into a
> single one, by closing a gap in-between them. In "theory" we could
> combine some updates into a single transaction. But it won't be 100s ...
> 
> I think I'd prefer an API that doesn't require excessive ad-hoc
> extensions later once something in QEMU changes.
> 
> 
> I think in essence, all we care about is performing atomic changes that
> *have to be atomic*, because something we add during a transaction
> overlaps with something we remove during a transaction. Not necessarily
> all updates in a transaction!
> 
> My prototype essentially detects that scenario, but doesn't call new KVM
> interfaces to deal with these.

With "prototype" I assume you mean the patch linked above
(region_resize), not the userspace-only proposal you sent initially right?

> 
> I assume once we take that into consideration, we can mostly assume that
> any such atomic updates (only of the regions that really have to be part
> of an atomic update) won't involve more than a handful of memory
> regions. We could add a sane KVM API limit.
> 
> And if we run into that API limit in QEMU, we can print a warning and do
> it non-atomically.
> 
If we implement single operations (split/merge/grow/shrink), we don't
even need that limit. Except for merge, maybe.

Ok, if it'ok for you all I can try to use David patch and implement some
simple grow/shrink. Then we need to figure where and when exactly QEMU
performs split and merge operations, and maybe implement something
similar to what you did in your proposal?

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
  2022-10-13 11:12                                       ` Emanuele Giuseppe Esposito
@ 2022-10-13 14:45                                         ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2022-10-13 14:45 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Sean Christopherson, Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, linux-kernel, Like Xu

>> I remember that there were BUG reports where we'd actually split and run
>> into that problem. Just don't have them at hand. I think they happened
>> during early boot when the OS re-configured some PCI thingies.
> 
> If you could point me where this is happening, it would be nice. So far
> I could not find or see any split/merge operation.

I remember some bugzilla, but it might be hard to find. Essentially, it 
happened that early during boot that it wasn't really a problem so far 
(single CPU running that triggers it IIRC).

But I might be messing up some details.

>>
>>>
>>>> That would probably require more KVM code overall, but each operation
>>>> would be more
>>>> tightly bounded and thus simpler to define.  And I think more precise
>>>> APIs would
>>>> provide other benefits, e.g. growing a region wouldn't require first
>>>> deleting the
>>>> current region, and so could avoid zapping SPTEs and destroying
>>>> metadata.  Merge,
>>>> split, and truncate would have similar benefits, though they might be
>>>> more
>>>> difficult to realize in practice.
>>>
>>> So essentially grow would not require INVALIDATE. Makes sense, but would
>>> it work also with shrink? I guess so, as the memslot is still present
>>> (but shrinked) right?
>>>
>>> Paolo, would you be ok with this smaller API? Probably just starting
>>> with grow and shrink first.
>>>
>>> I am not against any of the two approaches:
>>> - my approach has the disadvantage that the list could be arbitrarily
>>> long, and it is difficult to rollback the intermediate changes if
>>> something breaks during the request processing (but could be simplified
>>> by making kvm exit or crash).
>>>
>>> - Sean approach could potentially provide more burden to the userspace,
>>> as we need to figure which operation is which. Also from my
>>> understanding split and merge won't be really straightforward to
>>> implement, especially in userspace.
>>>
>>> David, any concern from userspace prospective on this "CISC" approach?
>>
>> In contrast to resizes in QEMU that only affect a single memory
>> region/slot, splitting/merging is harder to factor out and communicate
>> to a notifier. As an alternative, we could handle it in the commit stage
>> in the notifier itself, similar to what my prototype does, and figure
>> out what needs to be done in there and how to call the proper KVM
>> interface (and which interface to call).
>>
>> With virtio-mem (in the future) we might see merges of 2 slots into a
>> single one, by closing a gap in-between them. In "theory" we could
>> combine some updates into a single transaction. But it won't be 100s ...
>>
>> I think I'd prefer an API that doesn't require excessive ad-hoc
>> extensions later once something in QEMU changes.
>>
>>
>> I think in essence, all we care about is performing atomic changes that
>> *have to be atomic*, because something we add during a transaction
>> overlaps with something we remove during a transaction. Not necessarily
>> all updates in a transaction!
>>
>> My prototype essentially detects that scenario, but doesn't call new KVM
>> interfaces to deal with these.
> 
> With "prototype" I assume you mean the patch linked above
> (region_resize), not the userspace-only proposal you sent initially right?

I meant from the userspace-only proposal the part where we collect all 
add/remove callabcks in a list, and in the commit stage try to detect if 
there is an overlap. That way we could isolate the memblocks that really 
only need an atomic operation. Anything that doesn't overlap can just go 
via the old interface.

Not saying that it wins a beauty price, but we wouldn't have to adjust 
QEMU core even more to teach it these special cases (but maybe we should 
and other notifiers might benefit from that in the long term).

> 
>>
>> I assume once we take that into consideration, we can mostly assume that
>> any such atomic updates (only of the regions that really have to be part
>> of an atomic update) won't involve more than a handful of memory
>> regions. We could add a sane KVM API limit.
>>
>> And if we run into that API limit in QEMU, we can print a warning and do
>> it non-atomically.
>>
> If we implement single operations (split/merge/grow/shrink), we don't
> even need that limit. Except for merge, maybe.

Right, in theory you could have multiple splits/merges in a single 
commit that all interact.

Like cutting out 8 pieces out of an existing larger memblock.

> 
> Ok, if it'ok for you all I can try to use David patch and implement some
> simple grow/shrink. Then we need to figure where and when exactly QEMU
> performs split and merge operations, and maybe implement something
> similar to what you did in your proposal?

Looking at grow/shrink first is certainly not a waste of time.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2022-10-13 14:45 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-09 10:44 [RFC PATCH 0/9] kvm: implement atomic memslot updates Emanuele Giuseppe Esposito
2022-09-09 10:44 ` [RFC PATCH 1/9] kvm_main.c: move slot check in kvm_set_memory_region Emanuele Giuseppe Esposito
2022-09-28 16:41   ` Paolo Bonzini
2022-09-09 10:44 ` [RFC PATCH 2/9] kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl Emanuele Giuseppe Esposito
2022-09-28 16:42   ` Paolo Bonzini
2022-09-09 10:45 ` [RFC PATCH 3/9] kvm_main.c: introduce kvm_internal_memory_region_list Emanuele Giuseppe Esposito
2022-09-28 16:48   ` Paolo Bonzini
2022-09-09 10:45 ` [RFC PATCH 4/9] kvm_main.c: split logic in kvm_set_memslots Emanuele Giuseppe Esposito
2022-09-28 17:04   ` Paolo Bonzini
2022-09-09 10:45 ` [RFC PATCH 5/9] kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and kvm_prepare_batch Emanuele Giuseppe Esposito
2022-09-13  2:56   ` Yang, Weijiang
2022-09-18 16:22     ` Emanuele Giuseppe Esposito
2022-09-28 17:11   ` Paolo Bonzini
2022-09-09 10:45 ` [RFC PATCH 6/9] kvm_main.c: simplify change-specific callbacks Emanuele Giuseppe Esposito
2022-09-09 10:45 ` [RFC PATCH 7/9] kvm_main.c: duplicate invalid memslot also in inactive list Emanuele Giuseppe Esposito
2022-09-28 17:18   ` Paolo Bonzini
2022-09-09 10:45 ` [RFC PATCH 8/9] kvm_main.c: find memslots from the inactive memslot list Emanuele Giuseppe Esposito
2022-09-09 10:45 ` [RFC PATCH 9/9] kvm_main.c: handle atomic memslot update Emanuele Giuseppe Esposito
2022-09-13  2:30   ` Yang, Weijiang
2022-09-18 16:18     ` Emanuele Giuseppe Esposito
2022-09-27  7:46   ` David Hildenbrand
2022-09-27  8:35     ` Emanuele Giuseppe Esposito
2022-09-27  9:22       ` David Hildenbrand
2022-09-27  9:32         ` Emanuele Giuseppe Esposito
2022-09-27 14:52           ` David Hildenbrand
2022-09-28 17:29   ` Paolo Bonzini
2022-09-09 14:30 ` [RFC PATCH 0/9] kvm: implement atomic memslot updates Sean Christopherson
2022-09-18 16:13   ` Emanuele Giuseppe Esposito
2022-09-19  7:38     ` Like Xu
2022-09-19  7:53     ` David Hildenbrand
2022-09-19 17:30       ` David Hildenbrand
2022-09-23 13:10         ` Emanuele Giuseppe Esposito
2022-09-23 13:21           ` David Hildenbrand
2022-09-23 13:38             ` Emanuele Giuseppe Esposito
2022-09-26  9:03               ` David Hildenbrand
2022-09-26 21:28                 ` Sean Christopherson
2022-09-27  7:38                   ` Emanuele Giuseppe Esposito
2022-09-27 15:58                     ` Sean Christopherson
2022-09-28  9:11                       ` Emanuele Giuseppe Esposito
2022-09-28 11:14                         ` Maxim Levitsky
2022-09-28 12:52                           ` David Hildenbrand
2022-09-28 15:07                       ` Paolo Bonzini
2022-09-28 15:33                         ` David Hildenbrand
2022-09-28 15:58                         ` Sean Christopherson
2022-09-28 16:38                           ` Paolo Bonzini
2022-09-28 20:41                             ` Sean Christopherson
2022-09-29  8:05                               ` Emanuele Giuseppe Esposito
2022-09-29  8:24                                 ` David Hildenbrand
2022-09-29 15:18                                 ` Sean Christopherson
2022-09-29 15:41                                   ` Paolo Bonzini
2022-09-29 15:28                               ` Paolo Bonzini
2022-09-29 15:40                                 ` Maxim Levitsky
2022-09-29 16:00                                 ` David Hildenbrand
2022-09-29 21:39                                 ` Sean Christopherson
2022-10-13  7:43                                   ` Emanuele Giuseppe Esposito
2022-10-13  8:44                                     ` David Hildenbrand
2022-10-13 11:12                                       ` Emanuele Giuseppe Esposito
2022-10-13 14:45                                         ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.