[PATCH 0/4] KVM: API to block and resume all running vcpus in a vm

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
@ 2022-10-22 15:48 Emanuele Giuseppe Esposito
  2022-10-22 15:48 ` [PATCH 1/4] linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list ioctl Emanuele Giuseppe Esposito
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-22 15:48 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel,
	Emanuele Giuseppe Esposito

This new API allows the userspace to stop all running
vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
KVM_RESUME_ALL_KICKED_VCPUS.
A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.

This serie is especially helpful to userspace hypervisors like
QEMU when they need to perform operations on memslots without the
risk of having a vcpu reading them in the meanwhile.
With "memslots operations" we mean grow, shrink, merge and split
memslots, which are not "atomic" because there is a time window
between the DELETE memslot operation and the CREATE one.
Currently, each memslot operation is performed with one or more
ioctls.
For example, merging two memslots into one would imply:
DELETE(m1)
DELETE(m2)
CREATE(m1+m2)

And a vcpu could attempt to read m2 right after it is deleted, but
before the new one is created.

Therefore the simplest solution is to pause all vcpus in the kvm
side, so that:
- userspace just needs to call the new API before making memslots
changes, keeping modifications to the minimum
- dirty page updates are also performed when vcpus are blocked, so
there is no time window between the dirty page ioctl and memslots
modifications, since vcpus are all stopped.
- no need to modify the existing memslots API

Emanuele Giuseppe Esposito (4):
  linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list
    ioctl
  KVM: introduce kvm_clear_all_cpus_request
  KVM: introduce memory transaction semaphore
  KVM: use signals to abort enter_guest/blocking and retry

 Documentation/virt/kvm/vcpu-requests.rst |  3 ++
 arch/x86/include/asm/kvm_host.h          |  2 ++
 arch/x86/kvm/x86.c                       |  8 +++++
 include/uapi/linux/kvm.h                 |  3 ++
 virt/kvm/kvm_main.c                      | 45 ++++++++++++++++++++++++
 5 files changed, 61 insertions(+)

-- 
2.31.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list ioctl
  2022-10-22 15:48 [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Emanuele Giuseppe Esposito
@ 2022-10-22 15:48 ` Emanuele Giuseppe Esposito
  2022-10-22 15:48 ` [PATCH 2/4] KVM: introduce kvm_clear_all_cpus_request Emanuele Giuseppe Esposito
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-22 15:48 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel,
	Emanuele Giuseppe Esposito

Introduce new KVM_KICK_ALL_RUNNING_VCPUS and KVM_RESUME_ALL_KICKED_VCPUS
ioctl that will be used respectively to pause and then resume all vcpus
currently executing KVM_RUN in kvm.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 include/uapi/linux/kvm.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index eed0315a77a6..d3cba8d4ca91 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2227,4 +2227,7 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+#define KVM_KICK_ALL_RUNNING_VCPUS		_IO(KVMIO,  0xd2)
+#define KVM_RESUME_ALL_KICKED_VCPUS		_IO(KVMIO,  0xd3)
+
 #endif /* __LINUX_KVM_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] KVM: introduce kvm_clear_all_cpus_request
  2022-10-22 15:48 [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Emanuele Giuseppe Esposito
  2022-10-22 15:48 ` [PATCH 1/4] linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list ioctl Emanuele Giuseppe Esposito
@ 2022-10-22 15:48 ` Emanuele Giuseppe Esposito
  2022-10-22 15:48 ` [PATCH 3/4] KVM: introduce memory transaction semaphore Emanuele Giuseppe Esposito
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-22 15:48 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel,
	Emanuele Giuseppe Esposito

Clear the given request in all vcpus of the VM with struct kvm.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 Documentation/virt/kvm/vcpu-requests.rst |  3 +++
 virt/kvm/kvm_main.c                      | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
index 31f62b64e07b..468410dfe84d 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -36,6 +36,9 @@ its TLB with a VCPU request.  The API consists of the following functions::
   /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
   bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
 
+  /* Clear request @req of all VCPUs of the VM with struct kvm @kvm. */
+  void kvm_clear_all_cpus_request(struct kvm *kvm, unsigned int req);
+
 Typically a requester wants the VCPU to perform the activity as soon
 as possible after making the request.  This means most requests
 (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 584a5bab3af3..c080b93edc0d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -355,6 +355,16 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
 }
 EXPORT_SYMBOL_GPL(kvm_make_all_cpus_request);
 
+void kvm_clear_all_cpus_request(struct kvm *kvm, unsigned int req)
+{
+	unsigned long i;
+	struct kvm_vcpu *vcpu;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(req, vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_clear_all_cpus_request);
+
 #ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] KVM: introduce memory transaction semaphore
  2022-10-22 15:48 [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Emanuele Giuseppe Esposito
  2022-10-22 15:48 ` [PATCH 1/4] linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list ioctl Emanuele Giuseppe Esposito
  2022-10-22 15:48 ` [PATCH 2/4] KVM: introduce kvm_clear_all_cpus_request Emanuele Giuseppe Esposito
@ 2022-10-22 15:48 ` Emanuele Giuseppe Esposito
  2022-10-23 17:50   ` Paolo Bonzini
  2022-10-22 15:48 ` [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry Emanuele Giuseppe Esposito
  2022-10-24  7:56 ` [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Christian Borntraeger
  4 siblings, 1 reply; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-22 15:48 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel,
	Emanuele Giuseppe Esposito

Right now the semaphore is only used to signal that a vcpu
entered KVM_RUN (not necessarly in guest mode, could be also
blocked/halted).
Later it will be used by specific ioctls (writers) to wait that
all vcpus (readers) exit from KVM_RUN.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 virt/kvm/kvm_main.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c080b93edc0d..ae0240928a4a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -119,6 +119,8 @@ static const struct file_operations stat_fops_per_vm;
 
 static struct file_operations kvm_chardev_ops;
 
+static DECLARE_RWSEM(memory_transaction);
+
 static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
 			   unsigned long arg);
 #ifdef CONFIG_KVM_COMPAT
@@ -4074,7 +4076,19 @@ static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+		/*
+		 * Notify that a vcpu wants to run, and thus could be reading
+		 * memslots.
+		 * If KVM_KICK_ALL_RUNNING_VCPUS runs afterwards, it will have
+		 * to wait that KVM_RUN exited and up_read() is called.
+		 * If KVM_KICK_ALL_RUNNING_VCPUS already returned but
+		 * KVM_RESUME_ALL_KICKED_VCPUS didn't start yet, then there
+		 * is a request pending for the vcpu that will cause it to
+		 * exit KVM_RUN.
+		 */
+		down_read(&memory_transaction);
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
+		up_read(&memory_transaction);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry
  2022-10-22 15:48 [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Emanuele Giuseppe Esposito
                   ` (2 preceding siblings ...)
  2022-10-22 15:48 ` [PATCH 3/4] KVM: introduce memory transaction semaphore Emanuele Giuseppe Esposito
@ 2022-10-22 15:48 ` Emanuele Giuseppe Esposito
  2022-10-23 17:48   ` Paolo Bonzini
  2022-10-24  7:56 ` [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Christian Borntraeger
  4 siblings, 1 reply; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-22 15:48 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel,
	Emanuele Giuseppe Esposito

Once a vcpu exectues KVM_RUN, it could enter two states:
enter guest mode, or block/halt.
Use a signal to allow a vcpu to exit the guest state or unblock,
so that it can exit KVM_RUN and release the read semaphore,
allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.

Note that the signal is not deleted and used to propagate the
exit reason till vcpu_run(). It will be clearead only by
KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
entering KVM_RUN and perform again all checks done in
kvm_arch_vcpu_ioctl_run() before entering the guest state,
where it will return again if the request is still set.

However, the userspace hypervisor should also try to avoid
continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
because such call will just translate in a back-to-back down_read()
and up_read() (thanks to the signal).

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c              |  8 ++++++++
 virt/kvm/kvm_main.c             | 21 +++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aa381ab69a19..d5c37f344d65 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -108,6 +108,8 @@
 	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
 	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_USERSPACE_KICK		\
+	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0c47b41c264..2af5f427b4e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	}
 
 	if (kvm_request_pending(vcpu)) {
+		if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
+			r = -EINTR;
+			goto out;
+		}
 		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
 			r = -EIO;
 			goto out;
@@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			r = vcpu_block(vcpu);
 		}
 
+		/* vcpu exited guest/unblocked because of this request */
+		if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
+			return -EINTR;
+
 		if (r <= 0)
 			break;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ae0240928a4a..13fa7229b85d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
 		goto out;
 	if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
 		goto out;
+	if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
+		goto out;
 
 	ret = 0;
 out:
@@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
 		break;
 	}
+	case KVM_KICK_ALL_RUNNING_VCPUS: {
+		/*
+		 * Notify all running vcpus that they have to stop.
+		 * Caught in kvm_arch_vcpu_ioctl_run()
+		 */
+		kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
+
+		/*
+		 * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
+		 */
+		down_write(&memory_transaction);
+		up_write(&memory_transaction);
+		break;
+	}
+	case KVM_RESUME_ALL_KICKED_VCPUS: {
+		/* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
+		kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
+		break;
+	}
 	case KVM_SET_USER_MEMORY_REGION: {
 		struct kvm_userspace_memory_region kvm_userspace_mem;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry
  2022-10-22 15:48 ` [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry Emanuele Giuseppe Esposito
@ 2022-10-23 17:48   ` Paolo Bonzini
  2022-10-24  7:43     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-23 17:48 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel

On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
> Once a vcpu exectues KVM_RUN, it could enter two states:
> enter guest mode, or block/halt.
> Use a signal to allow a vcpu to exit the guest state or unblock,
> so that it can exit KVM_RUN and release the read semaphore,
> allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
> 
> Note that the signal is not deleted and used to propagate the
> exit reason till vcpu_run(). It will be clearead only by
> KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
> entering KVM_RUN and perform again all checks done in
> kvm_arch_vcpu_ioctl_run() before entering the guest state,
> where it will return again if the request is still set.
> 
> However, the userspace hypervisor should also try to avoid
> continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
> because such call will just translate in a back-to-back down_read()
> and up_read() (thanks to the signal).

Since the userspace should anyway avoid going into this effectively-busy 
wait, what about clearing the request after the first exit?  The 
cancellation ioctl can be kept for vCPUs that are never entered after 
KVM_KICK_ALL_RUNNING_VCPUS.  Alternatively, kvm_clear_all_cpus_request 
could be done right before up_write().

Paolo

> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  2 ++
>   arch/x86/kvm/x86.c              |  8 ++++++++
>   virt/kvm/kvm_main.c             | 21 +++++++++++++++++++++
>   3 files changed, 31 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa381ab69a19..d5c37f344d65 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -108,6 +108,8 @@
>   	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>   #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
>   	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_USERSPACE_KICK		\
> +	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
>   
>   #define CR0_RESERVED_BITS                                               \
>   	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0c47b41c264..2af5f427b4e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   	}
>   
>   	if (kvm_request_pending(vcpu)) {
> +		if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
> +			r = -EINTR;
> +			goto out;
> +		}
>   		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
>   			r = -EIO;
>   			goto out;
> @@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>   			r = vcpu_block(vcpu);
>   		}
>   
> +		/* vcpu exited guest/unblocked because of this request */
> +		if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
> +			return -EINTR;
> +
>   		if (r <= 0)
>   			break;
>   
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ae0240928a4a..13fa7229b85d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>   		goto out;
>   	if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
>   		goto out;
> +	if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
> +		goto out;
>   
>   	ret = 0;
>   out:
> @@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
>   		r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>   		break;
>   	}
> +	case KVM_KICK_ALL_RUNNING_VCPUS: {
> +		/*
> +		 * Notify all running vcpus that they have to stop.
> +		 * Caught in kvm_arch_vcpu_ioctl_run()
> +		 */
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
> +
> +		/*
> +		 * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
> +		 */
> +		down_write(&memory_transaction);
> +		up_write(&memory_transaction);
> +		break;
> +	}
> +	case KVM_RESUME_ALL_KICKED_VCPUS: {
> +		/* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
> +		kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
> +		break;
> +	}
>   	case KVM_SET_USER_MEMORY_REGION: {
>   		struct kvm_userspace_memory_region kvm_userspace_mem;
>   


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] KVM: introduce memory transaction semaphore
  2022-10-22 15:48 ` [PATCH 3/4] KVM: introduce memory transaction semaphore Emanuele Giuseppe Esposito
@ 2022-10-23 17:50   ` Paolo Bonzini
  2022-10-24 12:57     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-23 17:50 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel

On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
> +static DECLARE_RWSEM(memory_transaction);

This cannot be global, it must be per-struct kvm.  Otherwise one VM can 
keep the rwsem indefinitely while a second VM hangs in 
KVM_KICK_ALL_RUNNING_VCPUS.

It can also be changed to an SRCU (with the down_write+up_write sequence 
changed to synchronize_srcu_expedited) which has similar characteristics 
to your use of the rwsem.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry
  2022-10-23 17:48   ` Paolo Bonzini
@ 2022-10-24  7:43     ` Emanuele Giuseppe Esposito
  2022-10-24  7:49       ` Emanuele Giuseppe Esposito
  2022-10-25 10:05       ` Paolo Bonzini
  0 siblings, 2 replies; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-24  7:43 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel



Am 23/10/2022 um 19:48 schrieb Paolo Bonzini:
> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>> Once a vcpu exectues KVM_RUN, it could enter two states:
>> enter guest mode, or block/halt.
>> Use a signal to allow a vcpu to exit the guest state or unblock,
>> so that it can exit KVM_RUN and release the read semaphore,
>> allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
>>
>> Note that the signal is not deleted and used to propagate the
>> exit reason till vcpu_run(). It will be clearead only by
>> KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
>> entering KVM_RUN and perform again all checks done in
>> kvm_arch_vcpu_ioctl_run() before entering the guest state,
>> where it will return again if the request is still set.
>>
>> However, the userspace hypervisor should also try to avoid
>> continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
>> because such call will just translate in a back-to-back down_read()
>> and up_read() (thanks to the signal).
> 
> Since the userspace should anyway avoid going into this effectively-busy
> wait, what about clearing the request after the first exit?  The
> cancellation ioctl can be kept for vCPUs that are never entered after
> KVM_KICK_ALL_RUNNING_VCPUS.  Alternatively, kvm_clear_all_cpus_request
> could be done right before up_write().

Clearing makes sense, but should we "trust" the userspace not to go into
busy wait?
What's the typical "contract" between KVM and the userspace? Meaning,
should we cover the basic usage mistakes like forgetting to busy wait on
KVM_RUN?

If we don't, I can add a comment when clearing and of course also
mention it in the API documentation (that I forgot to update, sorry :D)

Emanuele

> 
> Paolo
> 
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |  2 ++
>>   arch/x86/kvm/x86.c              |  8 ++++++++
>>   virt/kvm/kvm_main.c             | 21 +++++++++++++++++++++
>>   3 files changed, 31 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h
>> index aa381ab69a19..d5c37f344d65 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -108,6 +108,8 @@
>>       KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>   #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
>>       KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>> +#define KVM_REQ_USERSPACE_KICK        \
>> +    KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
>>     #define
>> CR0_RESERVED_BITS                                               \
>>       (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM |
>> X86_CR0_TS \
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index b0c47b41c264..2af5f427b4e9 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu
>> *vcpu)
>>       }
>>         if (kvm_request_pending(vcpu)) {
>> +        if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
>> +            r = -EINTR;
>> +            goto out;
>> +        }
>>           if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
>>               r = -EIO;
>>               goto out;
>> @@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>>               r = vcpu_block(vcpu);
>>           }
>>   +        /* vcpu exited guest/unblocked because of this request */
>> +        if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>> +            return -EINTR;
>> +
>>           if (r <= 0)
>>               break;
>>   diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index ae0240928a4a..13fa7229b85d 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu
>> *vcpu)
>>           goto out;
>>       if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
>>           goto out;
>> +    if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>> +        goto out;
>>         ret = 0;
>>   out:
>> @@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
>>           r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>>           break;
>>       }
>> +    case KVM_KICK_ALL_RUNNING_VCPUS: {
>> +        /*
>> +         * Notify all running vcpus that they have to stop.
>> +         * Caught in kvm_arch_vcpu_ioctl_run()
>> +         */
>> +        kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>> +
>> +        /*
>> +         * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
>> +         */
>> +        down_write(&memory_transaction);
>> +        up_write(&memory_transaction);
>> +        break;
>> +    }
>> +    case KVM_RESUME_ALL_KICKED_VCPUS: {
>> +        /* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
>> +        kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>> +        break;
>> +    }
>>       case KVM_SET_USER_MEMORY_REGION: {
>>           struct kvm_userspace_memory_region kvm_userspace_mem;
>>   
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry
  2022-10-24  7:43     ` Emanuele Giuseppe Esposito
@ 2022-10-24  7:49       ` Emanuele Giuseppe Esposito
  2022-10-25 10:05       ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-24  7:49 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel



Am 24/10/2022 um 09:43 schrieb Emanuele Giuseppe Esposito:
> 
> 
> Am 23/10/2022 um 19:48 schrieb Paolo Bonzini:
>> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>>> Once a vcpu exectues KVM_RUN, it could enter two states:
>>> enter guest mode, or block/halt.
>>> Use a signal to allow a vcpu to exit the guest state or unblock,
>>> so that it can exit KVM_RUN and release the read semaphore,
>>> allowing a pending KVM_KICK_ALL_RUNNING_VCPUS to continue.
>>>
>>> Note that the signal is not deleted and used to propagate the
>>> exit reason till vcpu_run(). It will be clearead only by
>>> KVM_RESUME_ALL_KICKED_VCPUS. This allows the vcpu to keep try
>>> entering KVM_RUN and perform again all checks done in
>>> kvm_arch_vcpu_ioctl_run() before entering the guest state,
>>> where it will return again if the request is still set.
>>>
>>> However, the userspace hypervisor should also try to avoid
>>> continuously calling KVM_RUN after invoking KVM_KICK_ALL_RUNNING_VCPUS,
>>> because such call will just translate in a back-to-back down_read()
>>> and up_read() (thanks to the signal).
>>
>> Since the userspace should anyway avoid going into this effectively-busy
>> wait, what about clearing the request after the first exit?  The
>> cancellation ioctl can be kept for vCPUs that are never entered after
>> KVM_KICK_ALL_RUNNING_VCPUS.  Alternatively, kvm_clear_all_cpus_request
>> could be done right before up_write().
> 
> Clearing makes sense, but should we "trust" the userspace not to go into
> busy wait?

Actually since this change is just a s/test/check, I would rather keep
test. If userspace does things wrong, this mechanism would still work
properly.

> What's the typical "contract" between KVM and the userspace? Meaning,
> should we cover the basic usage mistakes like forgetting to busy wait on
> KVM_RUN?
> 
> If we don't, I can add a comment when clearing and of course also
> mention it in the API documentation (that I forgot to update, sorry :D)
> 
> Emanuele
> 
>>
>> Paolo
>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> ---
>>>   arch/x86/include/asm/kvm_host.h |  2 ++
>>>   arch/x86/kvm/x86.c              |  8 ++++++++
>>>   virt/kvm/kvm_main.c             | 21 +++++++++++++++++++++
>>>   3 files changed, 31 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index aa381ab69a19..d5c37f344d65 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -108,6 +108,8 @@
>>>       KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>>   #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
>>>       KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>> +#define KVM_REQ_USERSPACE_KICK        \
>>> +    KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT)
>>>     #define
>>> CR0_RESERVED_BITS                                               \
>>>       (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM |
>>> X86_CR0_TS \
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index b0c47b41c264..2af5f427b4e9 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -10270,6 +10270,10 @@ static int vcpu_enter_guest(struct kvm_vcpu
>>> *vcpu)
>>>       }
>>>         if (kvm_request_pending(vcpu)) {
>>> +        if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu)) {
>>> +            r = -EINTR;
>>> +            goto out;
>>> +        }
>>>           if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
>>>               r = -EIO;
>>>               goto out;
>>> @@ -10701,6 +10705,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>>>               r = vcpu_block(vcpu);
>>>           }
>>>   +        /* vcpu exited guest/unblocked because of this request */
>>> +        if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>>> +            return -EINTR;
>>> +
>>>           if (r <= 0)
>>>               break;
>>>   diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index ae0240928a4a..13fa7229b85d 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -3431,6 +3431,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu
>>> *vcpu)
>>>           goto out;
>>>       if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
>>>           goto out;
>>> +    if (kvm_test_request(KVM_REQ_USERSPACE_KICK, vcpu))
>>> +        goto out;
>>>         ret = 0;
>>>   out:
>>> @@ -4668,6 +4670,25 @@ static long kvm_vm_ioctl(struct file *filp,
>>>           r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>>>           break;
>>>       }
>>> +    case KVM_KICK_ALL_RUNNING_VCPUS: {
>>> +        /*
>>> +         * Notify all running vcpus that they have to stop.
>>> +         * Caught in kvm_arch_vcpu_ioctl_run()
>>> +         */
>>> +        kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>>> +
>>> +        /*
>>> +         * Use wr semaphore to wait for all vcpus to exit from KVM_RUN.
>>> +         */
>>> +        down_write(&memory_transaction);
>>> +        up_write(&memory_transaction);
>>> +        break;
>>> +    }
>>> +    case KVM_RESUME_ALL_KICKED_VCPUS: {
>>> +        /* Remove all requests sent with KVM_KICK_ALL_RUNNING_VCPUS */
>>> +        kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
>>> +        break;
>>> +    }
>>>       case KVM_SET_USER_MEMORY_REGION: {
>>>           struct kvm_userspace_memory_region kvm_userspace_mem;
>>>   
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-22 15:48 [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Emanuele Giuseppe Esposito
                   ` (3 preceding siblings ...)
  2022-10-22 15:48 ` [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry Emanuele Giuseppe Esposito
@ 2022-10-24  7:56 ` Christian Borntraeger
  2022-10-24  8:33   ` Emanuele Giuseppe Esposito
  4 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2022-10-24  7:56 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel

Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito:
> This new API allows the userspace to stop all running
> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
> KVM_RESUME_ALL_KICKED_VCPUS.
> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
> 
> This serie is especially helpful to userspace hypervisors like
> QEMU when they need to perform operations on memslots without the
> risk of having a vcpu reading them in the meanwhile.
> With "memslots operations" we mean grow, shrink, merge and split
> memslots, which are not "atomic" because there is a time window
> between the DELETE memslot operation and the CREATE one.
> Currently, each memslot operation is performed with one or more
> ioctls.
> For example, merging two memslots into one would imply:
> DELETE(m1)
> DELETE(m2)
> CREATE(m1+m2)
> 
> And a vcpu could attempt to read m2 right after it is deleted, but
> before the new one is created.
> 
> Therefore the simplest solution is to pause all vcpus in the kvm
> side, so that:
> - userspace just needs to call the new API before making memslots
> changes, keeping modifications to the minimum
> - dirty page updates are also performed when vcpus are blocked, so
> there is no time window between the dirty page ioctl and memslots
> modifications, since vcpus are all stopped.
> - no need to modify the existing memslots API
Isnt QEMU able to achieve the same goal today by forcing all vCPUs
into userspace with a signal? Can you provide some rationale why this
is better in the cover letter or patch description?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-24  7:56 ` [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Christian Borntraeger
@ 2022-10-24  8:33   ` Emanuele Giuseppe Esposito
  2022-10-24  9:09     ` Christian Borntraeger
  0 siblings, 1 reply; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-24  8:33 UTC (permalink / raw)
  To: Christian Borntraeger, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel



Am 24/10/2022 um 09:56 schrieb Christian Borntraeger:
> Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito:
>> This new API allows the userspace to stop all running
>> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
>> KVM_RESUME_ALL_KICKED_VCPUS.
>> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
>>
>> This serie is especially helpful to userspace hypervisors like
>> QEMU when they need to perform operations on memslots without the
>> risk of having a vcpu reading them in the meanwhile.
>> With "memslots operations" we mean grow, shrink, merge and split
>> memslots, which are not "atomic" because there is a time window
>> between the DELETE memslot operation and the CREATE one.
>> Currently, each memslot operation is performed with one or more
>> ioctls.
>> For example, merging two memslots into one would imply:
>> DELETE(m1)
>> DELETE(m2)
>> CREATE(m1+m2)
>>
>> And a vcpu could attempt to read m2 right after it is deleted, but
>> before the new one is created.
>>
>> Therefore the simplest solution is to pause all vcpus in the kvm
>> side, so that:
>> - userspace just needs to call the new API before making memslots
>> changes, keeping modifications to the minimum
>> - dirty page updates are also performed when vcpus are blocked, so
>> there is no time window between the dirty page ioctl and memslots
>> modifications, since vcpus are all stopped.
>> - no need to modify the existing memslots API
> Isnt QEMU able to achieve the same goal today by forcing all vCPUs
> into userspace with a signal? Can you provide some rationale why this
> is better in the cover letter or patch description?
> 
David Hildenbrand tried to propose something similar here:
https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a

While it is not optimized, I think it's more complex that the current
serie, since qemu should also make sure all running ioctls finish and
prevent the new ones from getting executed.

Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop
the BQL.

Would that be ok as rationale?

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-24  8:33   ` Emanuele Giuseppe Esposito
@ 2022-10-24  9:09     ` Christian Borntraeger
  2022-10-24 22:45       ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2022-10-24  9:09 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Maxim Levitsky,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel



Am 24.10.22 um 10:33 schrieb Emanuele Giuseppe Esposito:
> 
> 
> Am 24/10/2022 um 09:56 schrieb Christian Borntraeger:
>> Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito:
>>> This new API allows the userspace to stop all running
>>> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with
>>> KVM_RESUME_ALL_KICKED_VCPUS.
>>> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl.
>>>
>>> This serie is especially helpful to userspace hypervisors like
>>> QEMU when they need to perform operations on memslots without the
>>> risk of having a vcpu reading them in the meanwhile.
>>> With "memslots operations" we mean grow, shrink, merge and split
>>> memslots, which are not "atomic" because there is a time window
>>> between the DELETE memslot operation and the CREATE one.
>>> Currently, each memslot operation is performed with one or more
>>> ioctls.
>>> For example, merging two memslots into one would imply:
>>> DELETE(m1)
>>> DELETE(m2)
>>> CREATE(m1+m2)
>>>
>>> And a vcpu could attempt to read m2 right after it is deleted, but
>>> before the new one is created.
>>>
>>> Therefore the simplest solution is to pause all vcpus in the kvm
>>> side, so that:
>>> - userspace just needs to call the new API before making memslots
>>> changes, keeping modifications to the minimum
>>> - dirty page updates are also performed when vcpus are blocked, so
>>> there is no time window between the dirty page ioctl and memslots
>>> modifications, since vcpus are all stopped.
>>> - no need to modify the existing memslots API
>> Isnt QEMU able to achieve the same goal today by forcing all vCPUs
>> into userspace with a signal? Can you provide some rationale why this
>> is better in the cover letter or patch description?
>>
> David Hildenbrand tried to propose something similar here:
> https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a
> 
> While it is not optimized, I think it's more complex that the current
> serie, since qemu should also make sure all running ioctls finish and
> prevent the new ones from getting executed.
> 
> Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop
> the BQL.
> 
> Would that be ok as rationale?

Yes that helps and should be part of the cover letter for the next iterations.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] KVM: introduce memory transaction semaphore
  2022-10-23 17:50   ` Paolo Bonzini
@ 2022-10-24 12:57     ` Emanuele Giuseppe Esposito
  2022-10-25 10:01       ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-10-24 12:57 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel



Am 23/10/2022 um 19:50 schrieb Paolo Bonzini:
> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>> +static DECLARE_RWSEM(memory_transaction);
> 
> This cannot be global, it must be per-struct kvm.  Otherwise one VM can
> keep the rwsem indefinitely while a second VM hangs in
> KVM_KICK_ALL_RUNNING_VCPUS.
> 
> It can also be changed to an SRCU (with the down_write+up_write sequence
> changed to synchronize_srcu_expedited) which has similar characteristics
> to your use of the rwsem.
> 

Makes sense, but why synchronize_srcu_expedited and not synchronize_srcu?

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-24  9:09     ` Christian Borntraeger
@ 2022-10-24 22:45       ` Sean Christopherson
  2022-10-25  9:33         ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-10-24 22:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Emanuele Giuseppe Esposito, kvm, Paolo Bonzini, Jonathan Corbet,
	Maxim Levitsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, David Hildenbrand, x86, H. Peter Anvin, linux-doc,
	linux-kernel

On Mon, Oct 24, 2022, Christian Borntraeger wrote:
> Am 24.10.22 um 10:33 schrieb Emanuele Giuseppe Esposito:
> > Am 24/10/2022 um 09:56 schrieb Christian Borntraeger:
> > > > Therefore the simplest solution is to pause all vcpus in the kvm
> > > > side, so that:

Simplest for QEMU maybe, most definitely not simplest for KVM.

> > > > - userspace just needs to call the new API before making memslots
> > > > changes, keeping modifications to the minimum
> > > > - dirty page updates are also performed when vcpus are blocked, so
> > > > there is no time window between the dirty page ioctl and memslots
> > > > modifications, since vcpus are all stopped.
> > > > - no need to modify the existing memslots API
> > > Isnt QEMU able to achieve the same goal today by forcing all vCPUs
> > > into userspace with a signal? Can you provide some rationale why this
> > > is better in the cover letter or patch description?
> > > 
> > David Hildenbrand tried to propose something similar here:
> > https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a
> > 
> > While it is not optimized, I think it's more complex that the current
> > serie, since qemu should also make sure all running ioctls finish and
> > prevent the new ones from getting executed.
> > 
> > Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop
> > the BQL.
> > 
> > Would that be ok as rationale?
> 
> Yes that helps and should be part of the cover letter for the next iterations.

But that doesn't explain why KVM needs to get involved, it only explains why QEMU
can't use its existing pause_all_vcpus().  I do not understand why this is a
problem QEMU needs KVM's help to solve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-24 22:45       ` Sean Christopherson
@ 2022-10-25  9:33         ` Paolo Bonzini
  2022-10-25 15:55           ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-25  9:33 UTC (permalink / raw)
  To: Sean Christopherson, Christian Borntraeger
  Cc: Emanuele Giuseppe Esposito, kvm, Jonathan Corbet, Maxim Levitsky,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel

On 10/25/22 00:45, Sean Christopherson wrote:
>> Yes that helps and should be part of the cover letter for the next iterations.
> But that doesn't explain why KVM needs to get involved, it only explains why QEMU
> can't use its existing pause_all_vcpus().  I do not understand why this is a
> problem QEMU needs KVM's help to solve.

I agree that it's not KVM's problem that QEMU cannot use 
pause_all_vcpus().  Having a ioctl in KVM, rather than coding the same 
in QEMU, is *mostly* a matter of programmer and computer efficiency 
because the code is pretty simple.

That said, I believe the limited memslot API makes it more than just a 
QEMU problem.  Because KVM_GET_DIRTY_LOG cannot be combined atomically 
with KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log 
regions while the VM is running is liable to losing the dirty status of 
some pages.  That's also a reason to provide this API in KVM.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] KVM: introduce memory transaction semaphore
  2022-10-24 12:57     ` Emanuele Giuseppe Esposito
@ 2022-10-25 10:01       ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-25 10:01 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel

On 10/24/22 14:57, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 23/10/2022 um 19:50 schrieb Paolo Bonzini:
>> On 10/22/22 17:48, Emanuele Giuseppe Esposito wrote:
>>> +static DECLARE_RWSEM(memory_transaction);
>>
>> This cannot be global, it must be per-struct kvm.  Otherwise one VM can
>> keep the rwsem indefinitely while a second VM hangs in
>> KVM_KICK_ALL_RUNNING_VCPUS.
>>
>> It can also be changed to an SRCU (with the down_write+up_write sequence
>> changed to synchronize_srcu_expedited) which has similar characteristics
>> to your use of the rwsem.
>>
> 
> Makes sense, but why synchronize_srcu_expedited and not synchronize_srcu?

Because (thanks to the kick) you expect the grace period to end almost 
immediately, and synchronize_srcu() will slow down sensibly the changes 
to the memory map.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry
  2022-10-24  7:43     ` Emanuele Giuseppe Esposito
  2022-10-24  7:49       ` Emanuele Giuseppe Esposito
@ 2022-10-25 10:05       ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-25 10:05 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, kvm
  Cc: Jonathan Corbet, Maxim Levitsky, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	David Hildenbrand, x86, H. Peter Anvin, linux-doc, linux-kernel

On 10/24/22 09:43, Emanuele Giuseppe Esposito wrote:
>> Since the userspace should anyway avoid going into this effectively-busy
>> wait, what about clearing the request after the first exit?  The
>> cancellation ioctl can be kept for vCPUs that are never entered after
>> KVM_KICK_ALL_RUNNING_VCPUS.  Alternatively, kvm_clear_all_cpus_request
>> could be done right before up_write().
>
> Clearing makes sense, but should we "trust" the userspace not to go into
> busy wait?

I think so, there are many other ways for userspace to screw up.

> What's the typical "contract" between KVM and the userspace? Meaning,
> should we cover the basic usage mistakes like forgetting to busy wait on
> KVM_RUN?

Being able to remove the second ioctl if you do (sort-of pseudocode 
based on this v1)

	kvm_make_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);
	down_write(&kvm->memory_transaction);
	up_write(&kvm->memory_transaction);
	kvm_clear_all_cpus_request(kvm, KVM_REQ_USERSPACE_KICK);

would be worth it, I think.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-25  9:33         ` Paolo Bonzini
@ 2022-10-25 15:55           ` Sean Christopherson
  2022-10-25 21:34             ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-10-25 15:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christian Borntraeger, Emanuele Giuseppe Esposito, kvm,
	Jonathan Corbet, Maxim Levitsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel

On Tue, Oct 25, 2022, Paolo Bonzini wrote:
> On 10/25/22 00:45, Sean Christopherson wrote:
> > > Yes that helps and should be part of the cover letter for the next iterations.
> > But that doesn't explain why KVM needs to get involved, it only explains why QEMU
> > can't use its existing pause_all_vcpus().  I do not understand why this is a
> > problem QEMU needs KVM's help to solve.
> 
> I agree that it's not KVM's problem that QEMU cannot use pause_all_vcpus().
> Having a ioctl in KVM, rather than coding the same in QEMU, is *mostly* a
> matter of programmer and computer efficiency because the code is pretty
> simple.
> 
> That said, I believe the limited memslot API makes it more than just a QEMU
> problem.  Because KVM_GET_DIRTY_LOG cannot be combined atomically with
> KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions
> while the VM is running is liable to losing the dirty status of some pages.

... and doesn't already do the sane thing and pause vCPUs _and anything else that
can touch guest memory_ before modifying memslots.  I honestly think QEMU is the
only VMM that would ever use this API.

> That's also a reason to provide this API in KVM.

It's frankly a terrible API though.  Providing a way to force vCPUs out of KVM_RUN
is at best half of the solution.  

Userspace still needs:

  - a refcounting scheme to track the number of "holds" put on the system
  - serialization to ensure KVM_RESUME_ALL_KICKED_VCPUS completes before a new
    KVM_KICK_ALL_RUNNING_VCPUS is initiated
  - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots
  - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
  - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck
    outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series)

And because of the nature of KVM, to support this API on all architectures, KVM
needs to make change on all architectures, whereas userspace should be able to
implement a generic solution.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-25 15:55           ` Sean Christopherson
@ 2022-10-25 21:34             ` Paolo Bonzini
  2022-10-25 23:07               ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-25 21:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Christian Borntraeger, Emanuele Giuseppe Esposito, kvm,
	Jonathan Corbet, Maxim Levitsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel

On 10/25/22 17:55, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Paolo Bonzini wrote:
>> That said, I believe the limited memslot API makes it more than just a QEMU
>> problem.  Because KVM_GET_DIRTY_LOG cannot be combined atomically with
>> KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions
>> while the VM is running is liable to losing the dirty status of some pages.
> 
> ... and doesn't already do the sane thing and pause vCPUs _and anything else that
> can touch guest memory_ before modifying memslots. I honestly think QEMU is the > only VMM that would ever use this API. Providing a way to force vCPUs 
out of KVM_RUN> is at best half of the solution.

I agree this is not a full solution (and I do want to remove 
KVM_RESUME_ALL_KICKED_VCPUS).

>    - a refcounting scheme to track the number of "holds" put on the system
>    - serialization to ensure KVM_RESUME_ALL_KICKED_VCPUS completes before a new
>      KVM_KICK_ALL_RUNNING_VCPUS is initiated

Both of these can be just a mutex, the others are potentially more 
interesting but I'm not sure I understand them:

>    - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots

This is perhaps an occasion to solve another disagreement: I still think 
that accessing memory outside KVM_RUN (for example KVM_SET_NESTED_STATE 
loading the APICv pages from VMCS12) is a bug, on the other hand we 
disagreed on that and you wanted to kill KVM_REQ_GET_NESTED_STATE_PAGES.

>    - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT

Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider 
it a guest bug to access the memory (i.e. ignore the strange read-only 
changes which only happen at boot, and which I agree are QEMU-specific)?

>    - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck
>      outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series)

This is the more important one but why would it livelock?

> And because of the nature of KVM, to support this API on all architectures, KVM
> needs to make change on all architectures, whereas userspace should be able to
> implement a generic solution.

Yes, I agree that this is essentially just a more efficient kill(). 
Emanuele, perhaps you can put together a patch to x86/vmexit.c in 
kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs 
are in a for(;;) busy wait, to measure the various ways to do it?

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm
  2022-10-25 21:34             ` Paolo Bonzini
@ 2022-10-25 23:07               ` Sean Christopherson
  2022-10-26 17:52                 ` Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm) Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-10-25 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christian Borntraeger, Emanuele Giuseppe Esposito, kvm,
	Jonathan Corbet, Maxim Levitsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel

On Tue, Oct 25, 2022, Paolo Bonzini wrote:
> On 10/25/22 17:55, Sean Christopherson wrote:
> > On Tue, Oct 25, 2022, Paolo Bonzini wrote:
> >    - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots
> 
> This is perhaps an occasion to solve another disagreement: I still think
> that accessing memory outside KVM_RUN (for example KVM_SET_NESTED_STATE
> loading the APICv pages from VMCS12) is a bug, on the other hand we
> disagreed on that and you wanted to kill KVM_REQ_GET_NESTED_STATE_PAGES.

I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g.
see the ARM ITS discussion in the dirty ring thread.  kvm_xen_set_evtchn() also
explicitly depends on writing guest memory without going through KVM_RUN (and
apparently can be invoked from a kernel thread?!?).

In theory, I do actually like the idea of restricting memory access to KVM_RUN,
but in reality I just think that forcing everything into KVM_RUN creates far more
problems than it solves.  E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES
is that instead of syncrhonously telling userspace it has a problem, KVM chugs
along as if everything is fine and only fails at later point in time.  I doubt
userspace would actually do anything differently, i.e. the VM is likely hosed no
matter what, but deferring work adds complexity in KVM and makes it more difficult
to debug problems when they occur.

> >    - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
> 
> Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it
> a guest bug to access the memory (i.e. ignore the strange read-only changes
> which only happen at boot, and which I agree are QEMU-specific)?

Yes?  I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. 

> >    - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck
> >      outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series)
> 
> This is the more important one but why would it livelock?

Livelock may not be the right word.  Peter's series is addressing a scenario where
a vCPU gets stuck faulting in a page because the page never arrives over the
network.  The solution is to recognize non-fatal signals while trying to fault in
the page.

KVM_KICK_ALL_RUNNING_VCPUS doesn't handle that case because it's obviously not
realistic to check for pending KVM requests while buried deep in mm/ code.  I.e.
userspace also needs to send SIGUSR1 or whatever to ensure all vCPUs get kicked
out of non-KVM code.

That's not the end of the world, and they probably end up being orthogonal things
in userspace code, but it yields a weird API because KVM_KICK_ALL_RUNNING_VCPUS
ends up with the caveat of "oh, by the way, userspace also needs to signal all
vCPU tasks too, otherwise KVM_KICK_ALL_RUNNING_VCPUS might hang".

> > And because of the nature of KVM, to support this API on all architectures, KVM
> > needs to make change on all architectures, whereas userspace should be able to
> > implement a generic solution.
> 
> Yes, I agree that this is essentially just a more efficient kill().
> Emanuele, perhaps you can put together a patch to x86/vmexit.c in
> kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in
> a for(;;) busy wait, to measure the various ways to do it?

I'm a bit confused.  Is the goal of this to simplify QEMU, dedup VMM code, provide
a more performant solution, something else entirely?  I.e. why measure the
performance of x86/vmexit.c?  I have a hard time believing the overhead of pausing
vCPUs is going to be the long pole when it comes to memslot changes.  I assume
rebuilding KVM's page tables because of the "zap all" behavior seems like would
completely dwarf any overhead from pausing vCPUs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm)
  2022-10-25 23:07               ` Sean Christopherson
@ 2022-10-26 17:52                 ` Paolo Bonzini
  2022-10-26 19:33                   ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-10-26 17:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Christian Borntraeger, Emanuele Giuseppe Esposito, kvm,
	Jonathan Corbet, Maxim Levitsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel

On 10/26/22 01:07, Sean Christopherson wrote:
> I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g.
> see the ARM ITS discussion in the dirty ring thread.  kvm_xen_set_evtchn() also
> explicitly depends on writing guest memory without going through KVM_RUN (and
> apparently can be invoked from a kernel thread?!?).

Yeah, those are the pages that must be considered dirty when using the 
dirty ring.

> In theory, I do actually like the idea of restricting memory access to KVM_RUN,
> but in reality I just think that forcing everything into KVM_RUN creates far more
> problems than it solves.  E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES
> is that instead of syncrhonously telling userspace it has a problem, KVM chugs
> along as if everything is fine and only fails at later point in time.  I doubt
> userspace would actually do anything differently, i.e. the VM is likely hosed no
> matter what, but deferring work adds complexity in KVM and makes it more difficult
> to debug problems when they occur.
>
>>>     - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
>>
>> Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it
>> a guest bug to access the memory (i.e. ignore the strange read-only changes
>> which only happen at boot, and which I agree are QEMU-specific)?
> 
> Yes?  I don't know exactly what "the KVM_GET_DIRTY_LOG case" is.

It is not possible to atomically read the dirty bitmap and delete a 
memslot.  When you delete a memslot, the bitmap is gone.  In this case 
however memory accesses to the deleted memslot are a guest bug, so 
stopping KVM-GT would not be necessary.

So while I'm being slowly convinced that QEMU should find a way to pause 
its vCPUs around memslot changes, I'm not sure that pausing everything 
is needed in general.

>>> And because of the nature of KVM, to support this API on all architectures, KVM
>>> needs to make change on all architectures, whereas userspace should be able to
>>> implement a generic solution.
>>
>> Yes, I agree that this is essentially just a more efficient kill().
>> Emanuele, perhaps you can put together a patch to x86/vmexit.c in
>> kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in
>> a for(;;) busy wait, to measure the various ways to do it?
> 
> I'm a bit confused.  Is the goal of this to simplify QEMU, dedup VMM code, provide
> a more performant solution, something else entirely?

Well, a bit of all of them and perhaps that's the problem.  And while 
the issues at hand *are* self-inflicted wounds on part of QEMU, it seems 
to me that the underlying issues are general.

For example, Alex Graf and I looked back at your proposal of a userspace 
exit for "bad" accesses to memory, wondering if it could help with 
Hyper-V VTLs too.  To recap, the "higher privileged" code at VTL1 can 
set up VM-wide restrictions on access to some pages through a hypercall 
(HvModifyVtlProtectionMask).  After the hypercall, VTL0 would not be 
able to access those pages.  The hypercall would be handled in userspace 
and would invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX 
permissions, and this ioctl would set up a VM-wide permission bitmap 
that would be used when building page tables.

Using such a bitmap instead of memslots makes it possible to cause 
userspace vmexits on VTL mapping violations with efficient data 
structures.  And it would also be possible to use this mechanism around 
KVM_GET_DIRTY_LOG, to read the KVM dirty bitmap just before removing a 
memslot.

However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN 
ioctls) would not be blocked, due to the lack of a way to report the 
exit.  The intersection of these features with VTLs should be very small 
(sometimes zero since VTLs are x86 only), but the ioctls would be a 
problem so I'm wondering what your thoughts are on this.

Also, while the exit API could be the same, it is not clear to me that 
the permission bitmap would be a good match for entirely "void" memslots 
used to work around non-atomic memslot changes.  So for now let's leave 
this aside and only consider the KVM_GET_DIRTY_LOG case.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm)
  2022-10-26 17:52                 ` Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm) Paolo Bonzini
@ 2022-10-26 19:33                   ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2022-10-26 19:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christian Borntraeger, Emanuele Giuseppe Esposito, kvm,
	Jonathan Corbet, Maxim Levitsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, David Hildenbrand, x86,
	H. Peter Anvin, linux-doc, linux-kernel

On Wed, Oct 26, 2022, Paolo Bonzini wrote:
> On 10/26/22 01:07, Sean Christopherson wrote:
> > > >     - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT
> > > 
> > > Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it
> > > a guest bug to access the memory (i.e. ignore the strange read-only changes
> > > which only happen at boot, and which I agree are QEMU-specific)?
> > 
> > Yes?  I don't know exactly what "the KVM_GET_DIRTY_LOG case" is.
> 
> It is not possible to atomically read the dirty bitmap and delete a memslot.
> When you delete a memslot, the bitmap is gone.  In this case however memory
> accesses to the deleted memslot are a guest bug, so stopping KVM-GT would
> not be necessary.

If accesses to the deleted memslot are a guest bug, why do you care about pausing
vCPUs?  I don't mean to be beligerent, I'm genuinely confused.

> So while I'm being slowly convinced that QEMU should find a way to pause its
> vCPUs around memslot changes, I'm not sure that pausing everything is needed
> in general.
> 
> > > > And because of the nature of KVM, to support this API on all architectures, KVM
> > > > needs to make change on all architectures, whereas userspace should be able to
> > > > implement a generic solution.
> > > 
> > > Yes, I agree that this is essentially just a more efficient kill().
> > > Emanuele, perhaps you can put together a patch to x86/vmexit.c in
> > > kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in
> > > a for(;;) busy wait, to measure the various ways to do it?
> > 
> > I'm a bit confused.  Is the goal of this to simplify QEMU, dedup VMM code, provide
> > a more performant solution, something else entirely?
> 
> Well, a bit of all of them and perhaps that's the problem.  And while the
> issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me
> that the underlying issues are general.
> 
> For example, Alex Graf and I looked back at your proposal of a userspace
> exit for "bad" accesses to memory, wondering if it could help with Hyper-V
> VTLs too.  To recap, the "higher privileged" code at VTL1 can set up VM-wide
> restrictions on access to some pages through a hypercall
> (HvModifyVtlProtectionMask).  After the hypercall, VTL0 would not be able to
> access those pages.  The hypercall would be handled in userspace and would
> invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions,
> and this ioctl would set up a VM-wide permission bitmap that would be used
> when building page tables.
> 
> Using such a bitmap instead of memslots makes it possible to cause userspace
> vmexits on VTL mapping violations with efficient data structures.  And it
> would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to
> read the KVM dirty bitmap just before removing a memslot.

What exactly is the behavior you're trying to achieve for KVM_GET_DIRTY_LOG => delete?
If KVM provides KVM_EXIT_MEMORY_FAULT, can you not achieve the desired behavior by
doing mprotect(PROT_NONE) => KVM_GET_DIRTY_LOG => delete?  If PROT_NONE causes the
memory to be freed, won't mprotect(PROT_READ) do what you want even without
KVM_EXIT_MEMORY_FAULT?

> However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN
> ioctls) would not be blocked, due to the lack of a way to report the exit.

Aren't all of those out of scope?  E.g. in a very hypothetical world where XEN's
event channel is being used with VTLs, if VTL1 makes the event channel inaccessible,
that's a guest and/or userspace configuration issue and the guest is hosed no matter
what KVM does.  Ditto for these case where KVM-GT's buffer is blocked.  I'm guessing
the ITS is similar?

> The intersection of these features with VTLs should be very small (sometimes
> zero since VTLs are x86 only), but the ioctls would be a problem so I'm
> wondering what your thoughts are on this.

How do the ioctls() map to VTLs?  I.e. are they considered VTL0, VTL1, out-of-band?

> Also, while the exit API could be the same, it is not clear to me that the
> permission bitmap would be a good match for entirely "void" memslots used to
> work around non-atomic memslot changes.  So for now let's leave this aside
> and only consider the KVM_GET_DIRTY_LOG case.

As above, can't userspace just mprotect() the entire memslot to prevent writes
between getting the dirty log and deleting the memslot?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-10-26 19:33 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-22 15:48 [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Emanuele Giuseppe Esposito
2022-10-22 15:48 ` [PATCH 1/4] linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list ioctl Emanuele Giuseppe Esposito
2022-10-22 15:48 ` [PATCH 2/4] KVM: introduce kvm_clear_all_cpus_request Emanuele Giuseppe Esposito
2022-10-22 15:48 ` [PATCH 3/4] KVM: introduce memory transaction semaphore Emanuele Giuseppe Esposito
2022-10-23 17:50   ` Paolo Bonzini
2022-10-24 12:57     ` Emanuele Giuseppe Esposito
2022-10-25 10:01       ` Paolo Bonzini
2022-10-22 15:48 ` [PATCH 4/4] KVM: use signals to abort enter_guest/blocking and retry Emanuele Giuseppe Esposito
2022-10-23 17:48   ` Paolo Bonzini
2022-10-24  7:43     ` Emanuele Giuseppe Esposito
2022-10-24  7:49       ` Emanuele Giuseppe Esposito
2022-10-25 10:05       ` Paolo Bonzini
2022-10-24  7:56 ` [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Christian Borntraeger
2022-10-24  8:33   ` Emanuele Giuseppe Esposito
2022-10-24  9:09     ` Christian Borntraeger
2022-10-24 22:45       ` Sean Christopherson
2022-10-25  9:33         ` Paolo Bonzini
2022-10-25 15:55           ` Sean Christopherson
2022-10-25 21:34             ` Paolo Bonzini
2022-10-25 23:07               ` Sean Christopherson
2022-10-26 17:52                 ` Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm) Paolo Bonzini
2022-10-26 19:33                   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).