All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] kvm: deliver async_pf to L1 only
@ 2016-12-12 14:32 Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Roman Kagan @ 2016-12-12 14:32 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini, kvm; +Cc: Denis Lunev, Roman Kagan

Async pagefaults should only be injected into L1 guests.

This series tries to make certain that async_pfs are not injected when L2 is
executing on a vCPU, but also that L2 exits to L1 as soon as there are ready
pages, to allow L1 to make better scheduling decisions.

Roman Kagan (5):
  kvm/x86: skip async_pf when in guest mode
  kvm: add helper for testing ready async_pf's
  kvm: kick vcpu when async_pf is resolved
  kvm/vmx: kick L2 guest to L1 by ready async_pf
  kvm/svm: kick L2 guest to L1 by ready async_pf

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
v1 -> v2:
 - dropped bogus patch for async_pf msr check
 - included patch to skip async_pf in guest mode as it disappeard from
   kvm/queue
 - avoid breaking s390 build in patch 3

 include/linux/kvm_host.h |  7 +++++++
 arch/x86/kvm/mmu.c       |  2 +-
 arch/x86/kvm/svm.c       | 10 ++++++++++
 arch/x86/kvm/vmx.c       |  9 +++++----
 arch/x86/kvm/x86.c       |  5 +++--
 virt/kvm/async_pf.c      |  8 ++++++--
 6 files changed, 32 insertions(+), 9 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
@ 2016-12-12 14:32 ` Roman Kagan
  2016-12-14 21:21   ` Radim Krčmář
  2016-12-12 14:32 ` [PATCH v2 2/5] kvm: add helper for testing ready async_pf's Roman Kagan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Roman Kagan @ 2016-12-12 14:32 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini, kvm; +Cc: Denis Lunev, Roman Kagan

Async pagefault machinery assumes communication with L1 guests only: all
the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
currently doesn't check if the vCPU is running L1 or L2, and may inject
a #PF into whatever context is currently executing.

In vmx this just results in crashing the L2 on bogus #PFs and hanging
tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
host with swap enabled, run a VM on it, run a nested VM on top, and set
RSS limit for L1 on the host via
/sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
to swap it out (you may need to tighten and loosen it once or twice, or
create some memory load inside L1).  Very quickly L2 guest starts
receiving pagefaults with bogus %cr2 (apf tokens from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait.

In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
then handled by L1 similar to ordinary async_pf.  However this only
works with KVM running in L1; another hypervisor may not expect this
(e.g.  VirtualBox asserts on #PF vmexit when NPT is on).

To avoid that, only do async_pf stuff when executing L1 guest.

Note: this patch only fixes x86; other async_pf-capable arches may also
need something similar.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
v1 -> v2:
 - more verbose commit log

 arch/x86/kvm/mmu.c | 2 +-
 arch/x86/kvm/x86.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7e98..cdafc61 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3510,7 +3510,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && can_do_async_pf(vcpu)) {
+	if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(gva, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04c5d96..bf11fe4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6864,7 +6864,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			break;
 		}
 
-		kvm_check_async_pf_completion(vcpu);
+		if (!is_guest_mode(vcpu))
+			kvm_check_async_pf_completion(vcpu);
 
 		if (signal_pending(current)) {
 			r = -EINTR;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/5] kvm: add helper for testing ready async_pf's
  2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
@ 2016-12-12 14:32 ` Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved Roman Kagan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Roman Kagan @ 2016-12-12 14:32 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini, kvm; +Cc: Denis Lunev, Roman Kagan

Add helper inline function to test if there are async_pf's ready on a
given vCPU.  There are already 3 callsites for it and I'm about to add
more.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
 include/linux/kvm_host.h | 7 +++++++
 arch/x86/kvm/x86.c       | 2 +-
 virt/kvm/async_pf.c      | 4 ++--
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..e10516f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1207,4 +1207,11 @@ static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_INVALID_WAKEUPS */
 
+#ifdef CONFIG_KVM_ASYNC_PF
+static inline bool kvm_async_pf_has_ready(struct kvm_vcpu *vcpu)
+{
+	return !list_empty_careful(&vcpu->async_pf.done);
+}
+#endif
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf11fe4..2536561 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8182,7 +8182,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 
 static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 {
-	if (!list_empty_careful(&vcpu->async_pf.done))
+	if (kvm_async_pf_has_ready(vcpu))
 		return true;
 
 	if (kvm_apic_has_events(vcpu))
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index efeceb0a..9cced14 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -159,7 +159,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 {
 	struct kvm_async_pf *work;
 
-	while (!list_empty_careful(&vcpu->async_pf.done) &&
+	while (kvm_async_pf_has_ready(vcpu) &&
 	      kvm_arch_can_inject_async_page_present(vcpu)) {
 		spin_lock(&vcpu->async_pf.lock);
 		work = list_first_entry(&vcpu->async_pf.done, typeof(*work),
@@ -227,7 +227,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
 {
 	struct kvm_async_pf *work;
 
-	if (!list_empty_careful(&vcpu->async_pf.done))
+	if (kvm_async_pf_has_ready(vcpu))
 		return 0;
 
 	work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved
  2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 2/5] kvm: add helper for testing ready async_pf's Roman Kagan
@ 2016-12-12 14:32 ` Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 5/5] kvm/svm: " Roman Kagan
  4 siblings, 0 replies; 12+ messages in thread
From: Roman Kagan @ 2016-12-12 14:32 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini, kvm
  Cc: Denis Lunev, Roman Kagan, Christian Borntraeger

When async_pf is ready the guest needs to be made aware of it ASAP,
because it may be holding off a higher priority task pending the
async_pf resolution in favor of a lower priority one.

In case async_pf's are harvested in vcpu context (x86) we have to not
only wake the vcpu up but kick it into host.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
---
v1 -> v2:
 - do not break s390 build by using kvm_vcpu_wakeup (Christian)

 virt/kvm/async_pf.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 9cced14..b5b9cca 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -105,8 +105,12 @@ static void async_pf_execute(struct work_struct *work)
 	 * This memory barrier pairs with prepare_to_wait's set_current_state()
 	 */
 	smp_mb();
+#ifdef CONFIG_KVM_ASYNC_PF_SYNC
 	if (swait_active(&vcpu->wq))
 		swake_up(&vcpu->wq);
+#else
+	kvm_vcpu_kick(vcpu);
+#endif
 
 	mmput(mm);
 	kvm_put_kvm(vcpu->kvm);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf
  2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
                   ` (2 preceding siblings ...)
  2016-12-12 14:32 ` [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved Roman Kagan
@ 2016-12-12 14:32 ` Roman Kagan
  2016-12-12 14:32 ` [PATCH v2 5/5] kvm/svm: " Roman Kagan
  4 siblings, 0 replies; 12+ messages in thread
From: Roman Kagan @ 2016-12-12 14:32 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini, kvm; +Cc: Denis Lunev, Roman Kagan

When async pagefault is resolved vCPU may be executing L2 guest.

In order to allow L1 take better scheduling decisions in such cases,
make L2 exit to L1 on a fake external interupt, without actually
injecting it (unless L2 has other reasons to vmexit).

This patch does that for x86/Intel.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
 arch/x86/kvm/vmx.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5382b82..fc97489 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10420,7 +10420,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 		return 0;
 	}
 
-	if ((kvm_cpu_has_interrupt(vcpu) || external_intr) &&
+	if ((kvm_cpu_has_interrupt(vcpu) || external_intr ||
+	     kvm_async_pf_has_ready(vcpu)) &&
 	    nested_exit_on_intr(vcpu)) {
 		if (vmx->nested.nested_run_pending)
 			return -EBUSY;
@@ -10772,9 +10773,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
 	if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
 	    && nested_exit_intr_ack_set(vcpu)) {
 		int irq = kvm_cpu_get_interrupt(vcpu);
-		WARN_ON(irq < 0);
-		vmcs12->vm_exit_intr_info = irq |
-			INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
+		if (irq >= 0)
+			vmcs12->vm_exit_intr_info = irq |
+				INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
 	}
 
 	trace_kvm_nested_vmexit_inject(vmcs12->vm_exit_reason,
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/5] kvm/svm: kick L2 guest to L1 by ready async_pf
  2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
                   ` (3 preceding siblings ...)
  2016-12-12 14:32 ` [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf Roman Kagan
@ 2016-12-12 14:32 ` Roman Kagan
  4 siblings, 0 replies; 12+ messages in thread
From: Roman Kagan @ 2016-12-12 14:32 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini, kvm; +Cc: Denis Lunev, Roman Kagan

When async pagefault is resolved vCPU may be executing L2 guest.

In order to allow L1 take better scheduling decisions in such cases,
make L2 exit to L1 on a fake external interupt, without actually
injecting it (unless L2 has other reasons to vmexit).

This patch does that for x86/AMD.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
 arch/x86/kvm/svm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8ca1eca..1f6ae15 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3723,6 +3723,15 @@ static int mwait_interception(struct vcpu_svm *svm)
 	return nop_interception(svm);
 }
 
+static int svm_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (kvm_async_pf_has_ready(vcpu))
+		nested_svm_intr(svm);
+	return 0;
+}
+
 enum avic_ipi_failure_cause {
 	AVIC_IPI_FAILURE_INVALID_INT_TYPE,
 	AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
@@ -5406,6 +5415,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 
 	.check_intercept = svm_check_intercept,
 	.handle_external_intr = svm_handle_external_intr,
+	.check_nested_events = svm_check_nested_events,
 
 	.sched_in = svm_sched_in,
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
@ 2016-12-14 21:21   ` Radim Krčmář
  2016-12-15  6:55     ` Roman Kagan
  0 siblings, 1 reply; 12+ messages in thread
From: Radim Krčmář @ 2016-12-14 21:21 UTC (permalink / raw)
  To: Roman Kagan; +Cc: Paolo Bonzini, kvm, Denis Lunev

2016-12-12 17:32+0300, Roman Kagan:
> Async pagefault machinery assumes communication with L1 guests only: all
> the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> currently doesn't check if the vCPU is running L1 or L2, and may inject
> a #PF into whatever context is currently executing.
> 
> In vmx this just results in crashing the L2 on bogus #PFs and hanging
> tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
> host with swap enabled, run a VM on it, run a nested VM on top, and set
> RSS limit for L1 on the host via
> /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> to swap it out (you may need to tighten and loosen it once or twice, or
> create some memory load inside L1).  Very quickly L2 guest starts
> receiving pagefaults with bogus %cr2 (apf tokens from the host
> actually), and L1 guest starts accumulating tasks stuck in D state in
> kvm_async_pf_task_wait.
> 
> In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
> then handled by L1 similar to ordinary async_pf.  However this only
> works with KVM running in L1; another hypervisor may not expect this
> (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).

async_pf is an optional paravirtual device.  It is L1's fault if it
enabled something that it doesn't support ...

AMD's behavior makes sense and already works, therefore I'd like to see
the same on Intel as well.  (I thought that SVM was broken as well,
sorry for my misleading first review.)

> To avoid that, only do async_pf stuff when executing L1 guest.

The good thing is that we are already killing VMX L1 with async_pf, so
regressions don't prevent us from making Intel KVM do the same as AMD:
force a nested VM exit from nested_vmx_check_exception() if the injected
#PF is async_pf and handle the #PF VM exit in L1.

I remember that you already implemented this and chose not to post it --
were there other problems than asserts in current KVM/VirtualBox?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-14 21:21   ` Radim Krčmář
@ 2016-12-15  6:55     ` Roman Kagan
  2016-12-15 15:09       ` Radim Krčmář
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Kagan @ 2016-12-15  6:55 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Paolo Bonzini, kvm, Denis Lunev

On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
> 2016-12-12 17:32+0300, Roman Kagan:
> > Async pagefault machinery assumes communication with L1 guests only: all
> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> > currently doesn't check if the vCPU is running L1 or L2, and may inject
> > a #PF into whatever context is currently executing.
> > 
> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
> > host with swap enabled, run a VM on it, run a nested VM on top, and set
> > RSS limit for L1 on the host via
> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> > to swap it out (you may need to tighten and loosen it once or twice, or
> > create some memory load inside L1).  Very quickly L2 guest starts
> > receiving pagefaults with bogus %cr2 (apf tokens from the host
> > actually), and L1 guest starts accumulating tasks stuck in D state in
> > kvm_async_pf_task_wait.
> > 
> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
> > then handled by L1 similar to ordinary async_pf.  However this only
> > works with KVM running in L1; another hypervisor may not expect this
> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
> 
> async_pf is an optional paravirtual device.  It is L1's fault if it
> enabled something that it doesn't support ...

async_pf in L1 is enabled by the core Linux; the hypervisor may be
third-party and have no control over it.

> AMD's behavior makes sense and already works, therefore I'd like to see
> the same on Intel as well.  (I thought that SVM was broken as well,
> sorry for my misleading first review.)
> 
> > To avoid that, only do async_pf stuff when executing L1 guest.
> 
> The good thing is that we are already killing VMX L1 with async_pf, so
> regressions don't prevent us from making Intel KVM do the same as AMD:
> force a nested VM exit from nested_vmx_check_exception() if the injected
> #PF is async_pf and handle the #PF VM exit in L1.

I'm not getting your point: the wealth of existing hypervisors running
in L1 which don't take #PF vmexits can be made not to hang or crash
their guests with a not so complex fix in L0 hypervisor.  Why do the
users need to update *both* their L0 and L1 hypervisors instead?

> I remember that you already implemented this and chose not to post it --
> were there other problems than asserts in current KVM/VirtualBox?

You must have confused me with someone else ;) I didn't implement this;
moreover I tend to think that L1 hypervisor cooperation is unnecessary
and the fix can be done in L0 only.

Roman.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-15  6:55     ` Roman Kagan
@ 2016-12-15 15:09       ` Radim Krčmář
  2016-12-19  7:18         ` Roman Kagan
  0 siblings, 1 reply; 12+ messages in thread
From: Radim Krčmář @ 2016-12-15 15:09 UTC (permalink / raw)
  To: Roman Kagan, Paolo Bonzini, kvm, Denis Lunev

2016-12-15 09:55+0300, Roman Kagan:
> On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
>> 2016-12-12 17:32+0300, Roman Kagan:
>> > Async pagefault machinery assumes communication with L1 guests only: all
>> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
>> > currently doesn't check if the vCPU is running L1 or L2, and may inject
>> > a #PF into whatever context is currently executing.
>> > 
>> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
>> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
>> > host with swap enabled, run a VM on it, run a nested VM on top, and set
>> > RSS limit for L1 on the host via
>> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>> > to swap it out (you may need to tighten and loosen it once or twice, or
>> > create some memory load inside L1).  Very quickly L2 guest starts
>> > receiving pagefaults with bogus %cr2 (apf tokens from the host
>> > actually), and L1 guest starts accumulating tasks stuck in D state in
>> > kvm_async_pf_task_wait.
>> > 
>> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
>> > then handled by L1 similar to ordinary async_pf.  However this only
>> > works with KVM running in L1; another hypervisor may not expect this
>> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
>> 
>> async_pf is an optional paravirtual device.  It is L1's fault if it
>> enabled something that it doesn't support ...
> 
> async_pf in L1 is enabled by the core Linux; the hypervisor may be
> third-party and have no control over it.

Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
doesn't support paravirtualized async_pf.  Linux allows only in-kernel
hypervisors that do have full control over it.

>> AMD's behavior makes sense and already works, therefore I'd like to see
>> the same on Intel as well.  (I thought that SVM was broken as well,
>> sorry for my misleading first review.)
>> 
>> > To avoid that, only do async_pf stuff when executing L1 guest.
>> 
>> The good thing is that we are already killing VMX L1 with async_pf, so
>> regressions don't prevent us from making Intel KVM do the same as AMD:
>> force a nested VM exit from nested_vmx_check_exception() if the injected
>> #PF is async_pf and handle the #PF VM exit in L1.
> 
> I'm not getting your point: the wealth of existing hypervisors running
> in L1 which don't take #PF vmexits can be made not to hang or crash
> their guests with a not so complex fix in L0 hypervisor.  Why do the
> users need to update *both* their L0 and L1 hypervisors instead?

L1 enables paravirtual async_pf to get notified about L0 page faults,
which would allow L1 to reschedule the blocked process and get better
performance.  Running a guest is just another process in L1, hence we
can assume that L1 is interested in being notified.

If you want a fix without changing L1 hypervisors, then you need to
regress KVM on SVM.
This series regresses needlessly, though -- it forces L1 to wait in L2
until the page for L2 is fetched by L0.
Even no-kvmapf in L1 is better, because L2 currently enters apf-halt
state and an event could trigger a nested VM exit to L1 or reschedule L2
to a task that isn't waiting for the page.

>> I remember that you already implemented this and chose not to post it --
>> were there other problems than asserts in current KVM/VirtualBox?
> 
> You must have confused me with someone else ;) I didn't implement this;
> moreover I tend to think that L1 hypervisor cooperation is unnecessary
> and the fix can be done in L0 only.

The feature already requires L1 cooperation and the extension handle L1
as a hypervisor seems natural to me: if L1 benefits from paravirtual
async_pf, then it will likely benefit from it even when running L2s.

Because the current state is already broken, I think it is a good time
to actually do the best known solution right away, instead of fixing
this fix later.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-15 15:09       ` Radim Krčmář
@ 2016-12-19  7:18         ` Roman Kagan
  2016-12-19  9:53           ` Paolo Bonzini
  2016-12-19 15:31           ` Radim Krčmář
  0 siblings, 2 replies; 12+ messages in thread
From: Roman Kagan @ 2016-12-19  7:18 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: Paolo Bonzini, kvm, Denis Lunev

On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
> 2016-12-15 09:55+0300, Roman Kagan:
> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
> >> 2016-12-12 17:32+0300, Roman Kagan:
> >> > Async pagefault machinery assumes communication with L1 guests only: all
> >> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject
> >> > a #PF into whatever context is currently executing.
> >> > 
> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
> >> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set
> >> > RSS limit for L1 on the host via
> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> >> > to swap it out (you may need to tighten and loosen it once or twice, or
> >> > create some memory load inside L1).  Very quickly L2 guest starts
> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host
> >> > actually), and L1 guest starts accumulating tasks stuck in D state in
> >> > kvm_async_pf_task_wait.
> >> > 
> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
> >> > then handled by L1 similar to ordinary async_pf.  However this only
> >> > works with KVM running in L1; another hypervisor may not expect this
> >> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
> >> 
> >> async_pf is an optional paravirtual device.  It is L1's fault if it
> >> enabled something that it doesn't support ...
> > 
> > async_pf in L1 is enabled by the core Linux; the hypervisor may be
> > third-party and have no control over it.
> 
> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
> doesn't support paravirtualized async_pf.  Linux allows only in-kernel
> hypervisors that do have full control over it.

Imagine you are a hoster providing VPSes to your customers.  You have
basically no control over what they run there.  Now if you are brave
enough to enable nested, you most certainly won't want async_pf to
create problems for your customers only because they have a kernel with
async_pf support and a hypervisor without (which at the moment means a
significant fraction of VPS owners).

> >> AMD's behavior makes sense and already works, therefore I'd like to see
> >> the same on Intel as well.  (I thought that SVM was broken as well,
> >> sorry for my misleading first review.)
> >> 
> >> > To avoid that, only do async_pf stuff when executing L1 guest.
> >> 
> >> The good thing is that we are already killing VMX L1 with async_pf, so
> >> regressions don't prevent us from making Intel KVM do the same as AMD:
> >> force a nested VM exit from nested_vmx_check_exception() if the injected
> >> #PF is async_pf and handle the #PF VM exit in L1.
> > 
> > I'm not getting your point: the wealth of existing hypervisors running
> > in L1 which don't take #PF vmexits can be made not to hang or crash
> > their guests with a not so complex fix in L0 hypervisor.  Why do the
> > users need to update *both* their L0 and L1 hypervisors instead?
> 
> L1 enables paravirtual async_pf to get notified about L0 page faults,
> which would allow L1 to reschedule the blocked process and get better
> performance.  Running a guest is just another process in L1, hence we
> can assume that L1 is interested in being notified.

That's a nice theory but in practice there is a fair amount of installed
VMs with a kernel that requests async_pf and a hypervisor that can't
live with it.

> If you want a fix without changing L1 hypervisors, then you need to
> regress KVM on SVM.

I don't buy this argument.  I don't see any significant difference from
L0's viewpoint between emulating a #PF vmexit and emulating an external
interrupt vmexit combined with #PF injection into L1.  The latter,
however, will keep L1 getting along just fine with the existing kernels
and hypervisors.

> This series regresses needlessly, though -- it forces L1 to wait in L2
> until the page for L2 is fetched by L0.

Indeed, it's half-baked.  I also just realized that it incorrectly does
nested vmexit before L1 vmentry but #PF injection is attempted on the
next round which defeats the whole purpose.  I'll rework the series once
I have the time (hopefully before x-mas).

Thanks,
Roman.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-19  7:18         ` Roman Kagan
@ 2016-12-19  9:53           ` Paolo Bonzini
  2016-12-19 15:31           ` Radim Krčmář
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2016-12-19  9:53 UTC (permalink / raw)
  To: Roman Kagan, Radim Krčmář, kvm, Denis Lunev

On 19/12/2016 08:18, Roman Kagan wrote:
> On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
>> 2016-12-15 09:55+0300, Roman Kagan:
>>> On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
>>>> async_pf is an optional paravirtual device.  It is L1's fault if it
>>>> enabled something that it doesn't support ...
>>>
>>> async_pf in L1 is enabled by the core Linux; the hypervisor may be
>>> third-party and have no control over it.
>>
>> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
>> doesn't support paravirtualized async_pf.  Linux allows only in-kernel
>> hypervisors that do have full control over it.
> 
> Imagine you are a hoster providing VPSes to your customers.  You have
> basically no control over what they run there.  Now if you are brave
> enough to enable nested, you most certainly won't want async_pf to
> create problems for your customers only because they have a kernel with
> async_pf support and a hypervisor without (which at the moment means a
> significant fraction of VPS owners).

If you enable nested you should have a (small) list of hypervisors that
you tested.  KVM and Xen should more or less work, ESX should (but I
never tested it), Hyper-V is new in 4.10.  Anything else should be
tested before by the hosting provider.

If virtualbox requires no-kvmapf, you need to document that.  VirtualBox
for Linux is an out-of-tree module, I think it would be pretty crazy to
support it on nested virt, more than a proprietary hypervisor.

Paolo

>>>> AMD's behavior makes sense and already works, therefore I'd like to see
>>>> the same on Intel as well.  (I thought that SVM was broken as well,
>>>> sorry for my misleading first review.)
>>>>
>>>>> To avoid that, only do async_pf stuff when executing L1 guest.
>>>>
>>>> The good thing is that we are already killing VMX L1 with async_pf, so
>>>> regressions don't prevent us from making Intel KVM do the same as AMD:
>>>> force a nested VM exit from nested_vmx_check_exception() if the injected
>>>> #PF is async_pf and handle the #PF VM exit in L1.
>>>
>>> I'm not getting your point: the wealth of existing hypervisors running
>>> in L1 which don't take #PF vmexits can be made not to hang or crash
>>> their guests with a not so complex fix in L0 hypervisor.  Why do the
>>> users need to update *both* their L0 and L1 hypervisors instead?
>>
>> L1 enables paravirtual async_pf to get notified about L0 page faults,
>> which would allow L1 to reschedule the blocked process and get better
>> performance.  Running a guest is just another process in L1, hence we
>> can assume that L1 is interested in being notified.
> 
> That's a nice theory but in practice there is a fair amount of installed
> VMs with a kernel that requests async_pf and a hypervisor that can't
> live with it.
> 
>> If you want a fix without changing L1 hypervisors, then you need to
>> regress KVM on SVM.
> 
> I don't buy this argument.  I don't see any significant difference from
> L0's viewpoint between emulating a #PF vmexit and emulating an external
> interrupt vmexit combined with #PF injection into L1.  The latter,
> however, will keep L1 getting along just fine with the existing kernels
> and hypervisors.
> 
>> This series regresses needlessly, though -- it forces L1 to wait in L2
>> until the page for L2 is fetched by L0.
> 
> Indeed, it's half-baked.  I also just realized that it incorrectly does
> nested vmexit before L1 vmentry but #PF injection is attempted on the
> next round which defeats the whole purpose.  I'll rework the series once
> I have the time (hopefully before x-mas).
> 
> Thanks,
> Roman.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
  2016-12-19  7:18         ` Roman Kagan
  2016-12-19  9:53           ` Paolo Bonzini
@ 2016-12-19 15:31           ` Radim Krčmář
  1 sibling, 0 replies; 12+ messages in thread
From: Radim Krčmář @ 2016-12-19 15:31 UTC (permalink / raw)
  To: Roman Kagan, Paolo Bonzini, kvm, Denis Lunev

2016-12-19 10:18+0300, Roman Kagan:
> On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
>> 2016-12-15 09:55+0300, Roman Kagan:
>> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
>> >> 2016-12-12 17:32+0300, Roman Kagan:
>> >> > Async pagefault machinery assumes communication with L1 guests only: all
>> >> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
>> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject
>> >> > a #PF into whatever context is currently executing.
>> >> > 
>> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
>> >> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
>> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set
>> >> > RSS limit for L1 on the host via
>> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>> >> > to swap it out (you may need to tighten and loosen it once or twice, or
>> >> > create some memory load inside L1).  Very quickly L2 guest starts
>> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host
>> >> > actually), and L1 guest starts accumulating tasks stuck in D state in
>> >> > kvm_async_pf_task_wait.
>> >> > 
>> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
>> >> > then handled by L1 similar to ordinary async_pf.  However this only
>> >> > works with KVM running in L1; another hypervisor may not expect this
>> >> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
>> >> 
>> >> async_pf is an optional paravirtual device.  It is L1's fault if it
>> >> enabled something that it doesn't support ...
>> > 
>> > async_pf in L1 is enabled by the core Linux; the hypervisor may be
>> > third-party and have no control over it.
>> 
>> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
>> doesn't support paravirtualized async_pf.  Linux allows only in-kernel
>> hypervisors that do have full control over it.
> 
> Imagine you are a hoster providing VPSes to your customers.  You have
> basically no control over what they run there.  Now if you are brave
> enough to enable nested, you most certainly won't want async_pf to
> create problems for your customers only because they have a kernel with
> async_pf support and a hypervisor without (which at the moment means a
> significant fraction of VPS owners).

In that situation, you already told your customers to disable kvm-apf,
because it is broken (on VMX).  After updating the L0, you announce that
kvm-apf can be enabled and depending on the fix that KVM uses, it is
either enabled only for sufficiently new L1, or even for older ones.
Not a big difference from VPS provider point of view, IMO.

(Hm, and VPS providers could use a toggle to disable kvm-apf on L0,
 because it adds overhead in scenarios with CPU overcommit.)

>> >> AMD's behavior makes sense and already works, therefore I'd like to see
>> >> the same on Intel as well.  (I thought that SVM was broken as well,
>> >> sorry for my misleading first review.)
>> >> 
>> >> > To avoid that, only do async_pf stuff when executing L1 guest.
>> >> 
>> >> The good thing is that we are already killing VMX L1 with async_pf, so
>> >> regressions don't prevent us from making Intel KVM do the same as AMD:
>> >> force a nested VM exit from nested_vmx_check_exception() if the injected
>> >> #PF is async_pf and handle the #PF VM exit in L1.
>> > 
>> > I'm not getting your point: the wealth of existing hypervisors running
>> > in L1 which don't take #PF vmexits can be made not to hang or crash
>> > their guests with a not so complex fix in L0 hypervisor.  Why do the
>> > users need to update *both* their L0 and L1 hypervisors instead?
>> 
>> L1 enables paravirtual async_pf to get notified about L0 page faults,
>> which would allow L1 to reschedule the blocked process and get better
>> performance.  Running a guest is just another process in L1, hence we
>> can assume that L1 is interested in being notified.
> 
> That's a nice theory but in practice there is a fair amount of installed
> VMs with a kernel that requests async_pf and a hypervisor that can't
> live with it.

Yes, and we don't have to care -- they live now, when kvm-apf is broken.

We can fix them in a way that is backward compatible with known
hypervisors, but the solution is worse because of that.
kvm-apf is just for L1 performance, so it should waste as little cycles
as possible and because users can't depend on working kvm-apf, I'd not
shackle ourselves by past mistakes.

>> If you want a fix without changing L1 hypervisors, then you need to
>> regress KVM on SVM.
> 
> I don't buy this argument.  I don't see any significant difference from
> L0's viewpoint between emulating a #PF vmexit and emulating an external
> interrupt vmexit combined with #PF injection into L1.  The latter,
> however, will keep L1 getting along just fine with the existing kernels
> and hypervisors.

Yes, the delivery method is not crucial, I'd accept another delivery
method if L1 on KVM+SVM doesn't regress performance.
The main regression is not forwarding L0 page faults to L1 while nested,
because of this condition:

  if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {

>> This series regresses needlessly, though -- it forces L1 to wait in L2
>> until the page for L2 is fetched by L0.
> 
> Indeed, it's half-baked.  I also just realized that it incorrectly does
> nested vmexit before L1 vmentry but #PF injection is attempted on the
> next round which defeats the whole purpose.

I also see separating the nested VM exit from the kvm-apf event delivery
as a regression -- doesn't delivering interrupt vector 14 in the nested
VM exit work without losing backward compatibility?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-12-19 15:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
2016-12-14 21:21   ` Radim Krčmář
2016-12-15  6:55     ` Roman Kagan
2016-12-15 15:09       ` Radim Krčmář
2016-12-19  7:18         ` Roman Kagan
2016-12-19  9:53           ` Paolo Bonzini
2016-12-19 15:31           ` Radim Krčmář
2016-12-12 14:32 ` [PATCH v2 2/5] kvm: add helper for testing ready async_pf's Roman Kagan
2016-12-12 14:32 ` [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved Roman Kagan
2016-12-12 14:32 ` [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf Roman Kagan
2016-12-12 14:32 ` [PATCH v2 5/5] kvm/svm: " Roman Kagan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.