linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications
@ 2020-05-11 16:47 Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 1/8] Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" Vitaly Kuznetsov
                   ` (9 more replies)
  0 siblings, 10 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

Concerns were expressed around (ab)using #PF for KVM's async_pf mechanism,
it seems that re-using #PF exception for a PV mechanism wasn't a great
idea after all. The Grand Plan is to switch to using e.g. #VE for 'page
not present' events and normal APIC interrupts for 'page ready' events.
This series does the later.

Changes since RFC:
- Using #PF for 'page ready' is deprecated and removed [Paolo Bonzini]
- 'reason' field in 'struct kvm_vcpu_pv_apf_data' is not used for 'page ready'
  notifications and 'pageready_token' is not used for 'page not present' events
  [Paolo Bonzini]
- Renamed MSR_KVM_ASYNC_PF2 -> MSR_KVM_ASYNC_PF_INT [Peter Xu]
- Drop 'enabled' field from MSR_KVM_ASYNC_PF_INT [Peter Xu]
- Other minor changes supporting the above.

Vitaly Kuznetsov (8):
  Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and
    "Page Ready" exceptions simultaneously"
  KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  KVM: introduce kvm_read_guest_offset_cached()
  KVM: x86: interrupt based APF page-ready event delivery
  KVM: x86: acknowledgment mechanism for async pf page ready
    notifications
  KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
  KVM: x86: Switch KVM guest to using interrupts for page ready APF
    delivery
  KVM: x86: drop KVM_PV_REASON_PAGE_READY case from
    kvm_handle_page_fault()

 Documentation/virt/kvm/cpuid.rst     |   6 ++
 Documentation/virt/kvm/msr.rst       | 106 ++++++++++++++------
 arch/s390/include/asm/kvm_host.h     |   2 +
 arch/x86/entry/entry_32.S            |   5 +
 arch/x86/entry/entry_64.S            |   5 +
 arch/x86/include/asm/hardirq.h       |   3 +
 arch/x86/include/asm/irq_vectors.h   |   6 +-
 arch/x86/include/asm/kvm_host.h      |   7 +-
 arch/x86/include/asm/kvm_para.h      |   6 ++
 arch/x86/include/uapi/asm/kvm_para.h |  11 ++-
 arch/x86/kernel/irq.c                |   9 ++
 arch/x86/kernel/kvm.c                |  42 ++++++--
 arch/x86/kvm/cpuid.c                 |   3 +-
 arch/x86/kvm/mmu/mmu.c               |  10 +-
 arch/x86/kvm/x86.c                   | 142 ++++++++++++++++++---------
 include/linux/kvm_host.h             |   3 +
 include/uapi/linux/kvm.h             |   1 +
 virt/kvm/async_pf.c                  |  10 ++
 virt/kvm/kvm_main.c                  |  19 +++-
 19 files changed, 295 insertions(+), 101 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/8] Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Vitaly Kuznetsov
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

Commit 9a6e7c39810e (""KVM: async_pf: Fix #DF due to inject "Page not
Present" and "Page Ready" exceptions simultaneously") added a protection
against 'page ready' notification coming before 'page not ready' is
delivered. This situation seems to be impossible since commit 2a266f23550b
("KVM MMU: check pending exception before injecting APF) which added
'vcpu->arch.exception.pending' check to kvm_can_do_async_pf.

On x86, kvm_arch_async_page_present() has only one call site:
kvm_check_async_pf_completion() loop and we only enter the loop when
kvm_arch_can_inject_async_page_present(vcpu) which when async pf msr
is enabled, translates into kvm_can_do_async_pf().

There is also one problem with the cancellation mechanism. We don't seem
to check that the 'page not ready' notification we're canceling matches
the 'page ready' notification so in theory, we may erroneously drop two
valid events.

Revert the commit.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/x86.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5835f9cb9ad..edd4a6415b92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10359,13 +10359,6 @@ static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
 				      sizeof(val));
 }
 
-static int apf_get_user(struct kvm_vcpu *vcpu, u32 *val)
-{
-
-	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, val,
-				      sizeof(u32));
-}
-
 static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
@@ -10430,7 +10423,6 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
 	struct x86_exception fault;
-	u32 val;
 
 	if (work->wakeup_all)
 		work->arch.token = ~0; /* broadcast wakeup */
@@ -10439,19 +10431,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
 
 	if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
-	    !apf_get_user(vcpu, &val)) {
-		if (val == KVM_PV_REASON_PAGE_NOT_PRESENT &&
-		    vcpu->arch.exception.pending &&
-		    vcpu->arch.exception.nr == PF_VECTOR &&
-		    !apf_put_user(vcpu, 0)) {
-			vcpu->arch.exception.injected = false;
-			vcpu->arch.exception.pending = false;
-			vcpu->arch.exception.nr = 0;
-			vcpu->arch.exception.has_error_code = false;
-			vcpu->arch.exception.error_code = 0;
-			vcpu->arch.exception.has_payload = false;
-			vcpu->arch.exception.payload = 0;
-		} else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
 			fault.vector = PF_VECTOR;
 			fault.error_code_valid = true;
 			fault.error_code = 0;
@@ -10459,7 +10439,6 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 			fault.address = work->arch.token;
 			fault.async_page_fault = true;
 			kvm_inject_page_fault(vcpu, &fault);
-		}
 	}
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 1/8] Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-12 15:27   ` Vivek Goyal
  2020-05-21 18:38   ` Vivek Goyal
  2020-05-11 16:47 ` [PATCH 3/8] KVM: introduce kvm_read_guest_offset_cached() Vitaly Kuznetsov
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.

The newly introduced apf_put_user_ready() temporary puts both reason
and token information, this will be changed to put token only when we
switch to interrupt based notifications.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kvm/x86.c                   | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..e3602a1de136 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
 
 struct kvm_vcpu_pv_apf_data {
 	__u32 reason;
-	__u8 pad[60];
+	__u32 pageready_token;
+	__u8 pad[56];
 	__u32 enabled;
 };
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index edd4a6415b92..28868cc16e4d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2662,7 +2662,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
-					sizeof(u32)))
+					sizeof(u64)))
 		return 1;
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
@@ -10352,8 +10352,17 @@ static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 	}
 }
 
-static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
+static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
 {
+	u32 reason = KVM_PV_REASON_PAGE_NOT_PRESENT;
+
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &reason,
+				      sizeof(reason));
+}
+
+static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
+{
+	u64 val = (u64)token << 32 | KVM_PV_REASON_PAGE_READY;
 
 	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
 				      sizeof(val));
@@ -10398,7 +10407,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
 
 	if (kvm_can_deliver_async_pf(vcpu) &&
-	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+	    !apf_put_user_notpresent(vcpu)) {
 		fault.vector = PF_VECTOR;
 		fault.error_code_valid = true;
 		fault.error_code = 0;
@@ -10431,7 +10440,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
 
 	if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
-	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
+	    !apf_put_user_ready(vcpu, work->arch.token)) {
 			fault.vector = PF_VECTOR;
 			fault.error_code_valid = true;
 			fault.error_code = 0;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/8] KVM: introduce kvm_read_guest_offset_cached()
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 1/8] Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery Vitaly Kuznetsov
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

We already have kvm_write_guest_offset_cached(), introduce read analogue.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 include/linux/kvm_host.h |  3 +++
 virt/kvm/kvm_main.c      | 19 ++++++++++++++-----
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01276e3d01b9..235109a23b8c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -733,6 +733,9 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 			   void *data, unsigned long len);
+int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+				 void *data, unsigned int offset,
+				 unsigned long len);
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 			 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 74bdb7bf3295..c425be6e6d98 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2503,13 +2503,15 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
-int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-			   void *data, unsigned long len)
+int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+				 void *data, unsigned int offset,
+				 unsigned long len)
 {
 	struct kvm_memslots *slots = kvm_memslots(kvm);
 	int r;
+	gpa_t gpa = ghc->gpa + offset;
 
-	BUG_ON(len > ghc->len);
+	BUG_ON(len + offset > ghc->len);
 
 	if (slots->generation != ghc->generation) {
 		if (__kvm_gfn_to_hva_cache_init(slots, ghc, ghc->gpa, ghc->len))
@@ -2520,14 +2522,21 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 		return -EFAULT;
 
 	if (unlikely(!ghc->memslot))
-		return kvm_read_guest(kvm, ghc->gpa, data, len);
+		return kvm_read_guest(kvm, gpa, data, len);
 
-	r = __copy_from_user(data, (void __user *)ghc->hva, len);
+	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
 	if (r)
 		return -EFAULT;
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached);
+
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+			  void *data, unsigned long len)
+{
+	return kvm_read_guest_offset_cached(kvm, ghc, data, 0, len);
+}
 EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
 
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  2020-05-11 16:47 ` [PATCH 3/8] KVM: introduce kvm_read_guest_offset_cached() Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-12 14:24   ` Vivek Goyal
  2020-05-11 16:47 ` [PATCH 5/8] KVM: x86: acknowledgment mechanism for async pf page ready notifications Vitaly Kuznetsov
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

Concerns were expressed around APF delivery via synthetic #PF exception as
in some cases such delivery may collide with real page fault. For type 2
(page ready) notifications we can easily switch to using an interrupt
instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
legacy one.

One notable difference between the two mechanisms is that interrupt may not
get handled immediately so whenever we would like to deliver next event
(regardless of its type) we must be sure the guest had read and cleared
previous event in the slot.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
 arch/x86/include/asm/kvm_host.h      |  4 +-
 arch/x86/include/uapi/asm/kvm_para.h |  6 ++
 arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
 4 files changed, 140 insertions(+), 56 deletions(-)

diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 33892036672d..f988a36f226a 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
 	0x4b564d02
 
 data:
-	Bits 63-6 hold 64-byte aligned physical address of a
-	64 byte memory area which must be in guest RAM and must be
-	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
-	when asynchronous page faults are enabled on the vcpu 0 when
-	disabled. Bit 1 is 1 if asynchronous page faults can be injected
-	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
-	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
-	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
-
-	First 4 byte of 64 byte memory location will be written to by
-	the hypervisor at the time of asynchronous page fault (APF)
-	injection to indicate type of asynchronous page fault. Value
-	of 1 means that the page referred to by the page fault is not
-	present. Value 2 means that the page is now available. Disabling
-	interrupt inhibits APFs. Guest must not enable interrupt
-	before the reason is read, or it may be overwritten by another
-	APF. Since APF uses the same exception vector as regular page
-	fault guest must reset the reason to 0 before it does
-	something that can generate normal page fault.  If during page
-	fault APF reason is 0 it means that this is regular page
-	fault.
-
-	During delivery of type 1 APF cr2 contains a token that will
-	be used to notify a guest when missing page becomes
-	available. When page becomes available type 2 APF is sent with
-	cr2 set to the token associated with the page. There is special
-	kind of token 0xffffffff which tells vcpu that it should wake
-	up all processes waiting for APFs and no individual type 2 APFs
-	will be sent.
+	Asynchronous page fault (APF) control MSR.
+
+	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
+	which must be in guest RAM and must be zeroed. This memory is expected
+	to hold a copy of the following structure::
+
+	  struct kvm_vcpu_pv_apf_data {
+		__u32 reason;
+		__u32 pageready_token;
+		__u8 pad[56];
+		__u32 enabled;
+	  };
+
+	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
+	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
+	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
+	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
+	present in CPUID. Bit 3 enables interrupt based delivery of type 2
+	(page present) events.
+
+	First 4 byte of 64 byte memory location ('reason') will be written to
+	by the hypervisor at the time APF type 1 (page not present) injection.
+	The only possible values are '0' and '1'. Type 1 events are currently
+	always delivered as synthetic #PF exception. During delivery of type 1
+	APF CR2 register contains a token that will be used to notify the guest
+	when missing page becomes available. Guest is supposed to write '0' to
+	the location when it is done handling type 1 event so the next one can
+	be delivered.
+
+	Note, since APF type 1 uses the same exception vector as regular page
+	fault, guest must reset the reason to '0' before it does something that
+	can generate normal page fault. If during a page fault APF reason is '0'
+	it means that this is regular page fault.
+
+	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
+	to by the hypervisor at the time of type 2 (page ready) event injection.
+	The content of these bytes is a token which was previously delivered as
+	type 1 event. The event indicates the page in now available. Guest is
+	supposed to write '0' to the location when it is done handling type 2
+	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
+	specifying the interrupt vector for type 2 APF delivery needs to be
+	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
+
+	Note, previously, type 2 (page present) events were delivered via the
+	same #PF exception as type 1 (page not present) events but this is
+	now deprecated. If bit 3 (interrupt based delivery) is not set APF
+	events are not delivered.
 
 	If APF is disabled while there are outstanding APFs, they will
 	not be delivered.
@@ -319,3 +338,17 @@ data:
 
 	KVM guests can request the host not to poll on HLT, for example if
 	they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF_INT:
+	0x4b564d06
+
+data:
+	Second asynchronous page fault (APF) control MSR.
+
+	Bits 0-7: APIC vector for delivery of type 2 APF events (page ready
+	notifications).
+	Bits 8-63: Reserved
+
+	Interrupt vector for asynchnonous page ready notifications delivery.
+	The vector has to be set up before asynchronous page fault mechanism
+	is enabled in MSR_KVM_ASYNC_PF_EN.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42a2d0d3984a..4a72ba26a532 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,7 +763,9 @@ struct kvm_vcpu_arch {
 		bool halted;
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
 		struct gfn_to_hva_cache data;
-		u64 msr_val;
+		u64 msr_en_val; /* MSR_KVM_ASYNC_PF_EN */
+		u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
+		u16 vec;
 		u32 id;
 		bool send_user_only;
 		u32 host_apf_reason;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index e3602a1de136..c55ffa2c40b6 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -50,6 +50,7 @@
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_ASYNC_PF_INT	0x4b564d06
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -81,6 +82,11 @@ struct kvm_clock_pairing {
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
 #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
+#define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
+
+/* MSR_KVM_ASYNC_PF_INT */
+#define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
+
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28868cc16e4d..d1e9b072d279 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = {
 	HV_X64_MSR_TSC_EMULATION_STATUS,
 
 	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT,
 
 	MSR_IA32_TSC_ADJUST,
 	MSR_IA32_TSCDEADLINE,
@@ -2645,17 +2645,24 @@ static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
 	return r;
 }
 
+static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
+{
+	u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
+
+	return (vcpu->arch.apf.msr_en_val & mask) == mask;
+}
+
 static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 3:5 are reserved, Should be zero */
-	if (data & 0x38)
+	/* Bits 4:5 are reserved, Should be zero */
+	if (data & 0x30)
 		return 1;
 
-	vcpu->arch.apf.msr_val = data;
+	vcpu->arch.apf.msr_en_val = data;
 
-	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+	if (!kvm_pv_async_pf_enabled(vcpu)) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
 		return 0;
@@ -2667,7 +2674,25 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
+
 	kvm_async_pf_wakeup_all(vcpu);
+
+	return 0;
+}
+
+static int kvm_pv_enable_async_pf_int(struct kvm_vcpu *vcpu, u64 data)
+{
+	/* Bits 8-63 are reserved */
+	if (data >> 8)
+		return 1;
+
+	if (!lapic_in_kernel(vcpu))
+		return 1;
+
+	vcpu->arch.apf.msr_int_val = data;
+
+	vcpu->arch.apf.vec = data & KVM_ASYNC_PF_VEC_MASK;
+
 	return 0;
 }
 
@@ -2883,6 +2908,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (kvm_pv_enable_async_pf(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_ASYNC_PF_INT:
+		if (kvm_pv_enable_async_pf_int(vcpu, data))
+			return 1;
+		break;
 	case MSR_KVM_STEAL_TIME:
 
 		if (unlikely(!sched_info_on()))
@@ -3157,7 +3186,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = vcpu->arch.time;
 		break;
 	case MSR_KVM_ASYNC_PF_EN:
-		msr_info->data = vcpu->arch.apf.msr_val;
+		msr_info->data = vcpu->arch.apf.msr_en_val;
+		break;
+	case MSR_KVM_ASYNC_PF_INT:
+		msr_info->data = vcpu->arch.apf.msr_int_val;
 		break;
 	case MSR_KVM_STEAL_TIME:
 		msr_info->data = vcpu->arch.st.msr_val;
@@ -9500,7 +9532,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->arch.cr2 = 0;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
-	vcpu->arch.apf.msr_val = 0;
+	vcpu->arch.apf.msr_en_val = 0;
+	vcpu->arch.apf.msr_int_val = 0;
 	vcpu->arch.st.msr_val = 0;
 
 	kvmclock_reset(vcpu);
@@ -10362,10 +10395,24 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
 
 static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
 {
-	u64 val = (u64)token << 32 | KVM_PV_REASON_PAGE_READY;
+	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data,
+				       pageready_token);
 
-	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
-				      sizeof(val));
+	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
+					     &token, offset, sizeof(token));
+}
+
+static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
+{
+	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data,
+				       pageready_token);
+	u32 val;
+
+	if (kvm_read_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
+					 &val, offset, sizeof(val)))
+		return false;
+
+	return !val;
 }
 
 static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
@@ -10373,9 +10420,8 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
 		return false;
 
-	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
-	    (vcpu->arch.apf.send_user_only &&
-	     kvm_x86_ops.get_cpl(vcpu) == 0))
+	if (!kvm_pv_async_pf_enabled(vcpu) ||
+	    (vcpu->arch.apf.send_user_only && kvm_x86_ops.get_cpl(vcpu) == 0))
 		return false;
 
 	return true;
@@ -10431,7 +10477,10 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
-	struct x86_exception fault;
+	struct kvm_lapic_irq irq = {
+		.delivery_mode = APIC_DM_FIXED,
+		.vector = vcpu->arch.apf.vec
+	};
 
 	if (work->wakeup_all)
 		work->arch.token = ~0; /* broadcast wakeup */
@@ -10439,26 +10488,20 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
 	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
 
-	if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
-	    !apf_put_user_ready(vcpu, work->arch.token)) {
-			fault.vector = PF_VECTOR;
-			fault.error_code_valid = true;
-			fault.error_code = 0;
-			fault.nested_page_fault = false;
-			fault.address = work->arch.token;
-			fault.async_page_fault = true;
-			kvm_inject_page_fault(vcpu, &fault);
-	}
+	if (kvm_pv_async_pf_enabled(vcpu) &&
+	    !apf_put_user_ready(vcpu, work->arch.token))
+		kvm_apic_set_irq(vcpu, &irq, NULL);
+
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
 
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
 {
-	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
+	if (!kvm_pv_async_pf_enabled(vcpu))
 		return true;
 	else
-		return kvm_can_do_async_pf(vcpu);
+		return apf_pageready_slot_free(vcpu);
 }
 
 void kvm_arch_start_assignment(struct kvm *kvm)
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/8] KVM: x86: acknowledgment mechanism for async pf page ready notifications
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (3 preceding siblings ...)
  2020-05-11 16:47 ` [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 6/8] KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT Vitaly Kuznetsov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

If two page ready notifications happen back to back the second one is not
delivered and the only mechanism we currently have is
kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
only be performed with the next vmexit when it happens and in some cases
it may take a while. With interrupt based page ready notification delivery
the situation is even worse: unlike exceptions, interrupts are not handled
immediately so we must check if the slot is empty. This is slow and
unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
the fact that the slot is free and host should check its notification
queue. Mandate using it for interrupt based type 2 APF event delivery.

As kvm_check_async_pf_completion() is going away from vcpu_run() we need
a way to communicate the fact that vcpu->async_pf.done queue has
transitioned from empty to non-empty state. Introduce
kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/msr.rst       | 18 +++++++++++++++---
 arch/s390/include/asm/kvm_host.h     |  2 ++
 arch/x86/include/asm/kvm_host.h      |  3 +++
 arch/x86/include/uapi/asm/kvm_para.h |  1 +
 arch/x86/kvm/x86.c                   | 26 ++++++++++++++++++++++----
 virt/kvm/async_pf.c                  | 10 ++++++++++
 6 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index f988a36f226a..3c7afbcd243f 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -230,9 +230,10 @@ data:
 	The content of these bytes is a token which was previously delivered as
 	type 1 event. The event indicates the page in now available. Guest is
 	supposed to write '0' to the location when it is done handling type 2
-	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
-	specifying the interrupt vector for type 2 APF delivery needs to be
-	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
+	event so the next one can be delivered. It is also supposed to write
+	'1' to MSR_KVM_ASYNC_PF_ACK every time after clearing the location,
+	this forces KVM to re-scan its queue and deliver next pending
+	notification.
 
 	Note, previously, type 2 (page present) events were delivered via the
 	same #PF exception as type 1 (page not present) events but this is
@@ -352,3 +353,14 @@ data:
 	Interrupt vector for asynchnonous page ready notifications delivery.
 	The vector has to be set up before asynchronous page fault mechanism
 	is enabled in MSR_KVM_ASYNC_PF_EN.
+
+MSR_KVM_ASYNC_PF_ACK:
+	0x4b564d07
+
+data:
+	Asynchronous page fault (APF) acknowledgment.
+
+	When the guest is done processing type 2 APF event and 'pageready_token'
+	field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
+	write '1' to bit 0 of the MSR, this caused the host to re-scan its queue
+	and check if there are more notifications pending.
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index d6bcd34f3ec3..c9c5b8b71175 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -982,6 +982,8 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
 
+static inline void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) {}
+
 void kvm_arch_crypto_clear_masks(struct kvm *kvm);
 void kvm_arch_crypto_set_masks(struct kvm *kvm, unsigned long *apm,
 			       unsigned long *aqm, unsigned long *adm);
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a72ba26a532..0eeeb98ed02a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -83,6 +83,7 @@
 #define KVM_REQ_GET_VMCS12_PAGES	KVM_ARCH_REQ(24)
 #define KVM_REQ_APICV_UPDATE \
 	KVM_ARCH_REQ_FLAGS(25, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_APF_READY		KVM_ARCH_REQ(26)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -771,6 +772,7 @@ struct kvm_vcpu_arch {
 		u32 host_apf_reason;
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
+		bool pageready_pending;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
@@ -1643,6 +1645,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work);
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
+void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index c55ffa2c40b6..941fdf4569a4 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -51,6 +51,7 @@
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
 #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
+#define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d1e9b072d279..d21adc23a99f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = {
 	HV_X64_MSR_TSC_EMULATION_STATUS,
 
 	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT, MSR_KVM_ASYNC_PF_ACK,
 
 	MSR_IA32_TSC_ADJUST,
 	MSR_IA32_TSCDEADLINE,
@@ -2912,6 +2912,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (kvm_pv_enable_async_pf_int(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_ASYNC_PF_ACK:
+		if (data & 0x1) {
+			vcpu->arch.apf.pageready_pending = false;
+			kvm_check_async_pf_completion(vcpu);
+		}
+		break;
 	case MSR_KVM_STEAL_TIME:
 
 		if (unlikely(!sched_info_on()))
@@ -3191,6 +3197,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_ASYNC_PF_INT:
 		msr_info->data = vcpu->arch.apf.msr_int_val;
 		break;
+	case MSR_KVM_ASYNC_PF_ACK:
+		msr_info->data = 0;
+		break;
 	case MSR_KVM_STEAL_TIME:
 		msr_info->data = vcpu->arch.st.msr_val;
 		break;
@@ -8336,6 +8345,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_hv_process_stimers(vcpu);
 		if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
 			kvm_vcpu_update_apicv(vcpu);
+		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
+			kvm_check_async_pf_completion(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
@@ -8610,8 +8621,6 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			break;
 		}
 
-		kvm_check_async_pf_completion(vcpu);
-
 		if (signal_pending(current)) {
 			r = -EINTR;
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
@@ -10489,13 +10498,22 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
 
 	if (kvm_pv_async_pf_enabled(vcpu) &&
-	    !apf_put_user_ready(vcpu, work->arch.token))
+	    !apf_put_user_ready(vcpu, work->arch.token)) {
+		vcpu->arch.apf.pageready_pending = true;
 		kvm_apic_set_irq(vcpu, &irq, NULL);
+	}
 
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
 
+void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
+{
+	kvm_make_request(KVM_REQ_APF_READY, vcpu);
+	if (!vcpu->arch.apf.pageready_pending)
+		kvm_vcpu_kick(vcpu);
+}
+
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_pv_async_pf_enabled(vcpu))
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 15e5b037f92d..746d95cf98ad 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -51,6 +51,7 @@ static void async_pf_execute(struct work_struct *work)
 	unsigned long addr = apf->addr;
 	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
 	int locked = 1;
+	bool first;
 
 	might_sleep();
 
@@ -69,10 +70,14 @@ static void async_pf_execute(struct work_struct *work)
 		kvm_arch_async_page_present(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
+	first = list_empty(&vcpu->async_pf.done);
 	list_add_tail(&apf->link, &vcpu->async_pf.done);
 	apf->vcpu = NULL;
 	spin_unlock(&vcpu->async_pf.lock);
 
+	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
+		kvm_arch_async_page_present_queued(vcpu);
+
 	/*
 	 * apf may be freed by kvm_check_async_pf_completion() after
 	 * this point
@@ -202,6 +207,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
 {
 	struct kvm_async_pf *work;
+	bool first;
 
 	if (!list_empty_careful(&vcpu->async_pf.done))
 		return 0;
@@ -214,9 +220,13 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
 	INIT_LIST_HEAD(&work->queue); /* for list_del to work */
 
 	spin_lock(&vcpu->async_pf.lock);
+	first = list_empty(&vcpu->async_pf.done);
 	list_add_tail(&work->link, &vcpu->async_pf.done);
 	spin_unlock(&vcpu->async_pf.lock);
 
+	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
+		kvm_arch_async_page_present_queued(vcpu);
+
 	vcpu->async_pf.queued++;
 	return 0;
 }
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/8] KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (4 preceding siblings ...)
  2020-05-11 16:47 ` [PATCH 5/8] KVM: x86: acknowledgment mechanism for async pf page ready notifications Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 7/8] KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery Vitaly Kuznetsov
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

Introduce new capability to indicate that KVM supports interrupt based
delivery of type 2 APF events (page ready notifications). This includes
support for both MSR_KVM_ASYNC_PF_INT and MSR_KVM_ASYNC_PF_ACK.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/cpuid.rst     | 6 ++++++
 Documentation/virt/kvm/msr.rst       | 9 ++++++---
 arch/x86/include/uapi/asm/kvm_para.h | 1 +
 arch/x86/kvm/cpuid.c                 | 3 ++-
 arch/x86/kvm/x86.c                   | 1 +
 include/uapi/linux/kvm.h             | 1 +
 6 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index 01b081f6e7ea..98cbd257ac33 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -86,6 +86,12 @@ KVM_FEATURE_PV_SCHED_YIELD        13          guest checks this feature bit
                                               before using paravirtualized
                                               sched yield.
 
+KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
+                                              before using the second async
+                                              pf control msr 0x4b564d06 and
+                                              async pf acknowledgment msr
+                                              0x4b564d07.
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 3c7afbcd243f..0e1ec0412714 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -209,7 +209,8 @@ data:
 	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
 	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
 	present in CPUID. Bit 3 enables interrupt based delivery of type 2
-	(page present) events.
+	(page present) events. Bit 3 can be set only if KVM_FEATURE_ASYNC_PF_INT
+	is present in CPUID.
 
 	First 4 byte of 64 byte memory location ('reason') will be written to
 	by the hypervisor at the time APF type 1 (page not present) injection.
@@ -352,7 +353,8 @@ data:
 
 	Interrupt vector for asynchnonous page ready notifications delivery.
 	The vector has to be set up before asynchronous page fault mechanism
-	is enabled in MSR_KVM_ASYNC_PF_EN.
+	is enabled in MSR_KVM_ASYNC_PF_EN.  The MSR is only available if
+	KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
 
 MSR_KVM_ASYNC_PF_ACK:
 	0x4b564d07
@@ -363,4 +365,5 @@ data:
 	When the guest is done processing type 2 APF event and 'pageready_token'
 	field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
 	write '1' to bit 0 of the MSR, this caused the host to re-scan its queue
-	and check if there are more notifications pending.
+	and check if there are more notifications pending. The MSR is only
+	available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 941fdf4569a4..921a55de5004 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,6 +31,7 @@
 #define KVM_FEATURE_PV_SEND_IPI	11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
+#define KVM_FEATURE_ASYNC_PF_INT	14
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 901cd1fdecd9..790fe4988001 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
-			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
+			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
+			     (1 << KVM_FEATURE_ASYNC_PF_INT);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d21adc23a99f..ad826d922368 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3412,6 +3412,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
 	case KVM_CAP_ASYNC_PF:
+	case KVM_CAP_ASYNC_PF_INT:
 	case KVM_CAP_GET_TSC_KHZ:
 	case KVM_CAP_KVMCLOCK_CTRL:
 	case KVM_CAP_READONLY_MEM:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 428c7dde6b4b..15012f78a691 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_ASYNC_PF_INT 182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/8] KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (5 preceding siblings ...)
  2020-05-11 16:47 ` [PATCH 6/8] KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-11 16:47 ` [PATCH 8/8] KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault() Vitaly Kuznetsov
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

KVM now supports using interrupt for type 2 APF event delivery (page ready
notifications) and legacy mechanism was deprecated. Switch KVM guests to
the new one.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/entry/entry_32.S          |  5 ++++
 arch/x86/entry/entry_64.S          |  5 ++++
 arch/x86/include/asm/hardirq.h     |  3 +++
 arch/x86/include/asm/irq_vectors.h |  6 ++++-
 arch/x86/include/asm/kvm_para.h    |  6 +++++
 arch/x86/kernel/irq.c              |  9 +++++++
 arch/x86/kernel/kvm.c              | 42 +++++++++++++++++++++++-------
 7 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index b67bae7091d7..d574dadcb2a1 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1475,6 +1475,11 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, HYPERV_STIMER0_VECTOR,
 
 #endif /* CONFIG_HYPERV */
 
+#ifdef CONFIG_KVM_GUEST
+BUILD_INTERRUPT3(kvm_async_pf_vector, KVM_ASYNC_PF_VECTOR,
+		 kvm_async_pf_intr)
+#endif
+
 SYM_CODE_START(page_fault)
 	ASM_CLAC
 	pushl	$do_page_fault
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0e9504fabe52..6f127c1a6547 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1190,6 +1190,11 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	acrn_hv_callback_vector acrn_hv_vector_handler
 #endif
 
+#ifdef CONFIG_KVM_GUEST
+apicinterrupt3 KVM_ASYNC_PF_VECTOR \
+	kvm_async_pf_vector kvm_async_pf_intr
+#endif
+
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
 idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 07533795b8d2..be0fbb15ad7f 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#ifdef CONFIG_KVM_GUEST
+	unsigned int kvm_async_pf_pageready_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 889f8b1b5b7f..8879a9ecd908 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -104,7 +104,11 @@
 #define HYPERV_STIMER0_VECTOR		0xed
 #endif
 
-#define LOCAL_TIMER_VECTOR		0xec
+#ifdef CONFIG_KVM_GUEST
+#define KVM_ASYNC_PF_VECTOR		0xec
+#endif
+
+#define LOCAL_TIMER_VECTOR		0xeb
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9b4df6eaa11a..fde4f21607f9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -4,6 +4,7 @@
 
 #include <asm/processor.h>
 #include <asm/alternative.h>
+#include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
 
 extern void kvmclock_init(void);
@@ -93,6 +94,11 @@ void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
 void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
+extern void kvm_async_pf_vector(void);
+#ifdef CONFIG_TRACING
+#define trace_kvm_async_pf_vector kvm_async_pf_vector
+#endif
+__visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c7965ff429c5..a4c2f25ad74d 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -159,6 +159,15 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 				   irq_stats(j)->hyperv_stimer0_count);
 		seq_puts(p, "  Hyper-V stimer0 interrupts\n");
 	}
+#endif
+#ifdef CONFIG_KVM_GUEST
+	if (test_bit(KVM_ASYNC_PF_VECTOR, system_vectors)) {
+		seq_printf(p, "%*s: ", prec, "APF");
+		for_each_online_cpu(j)
+			seq_printf(p, "%10u ",
+				   irq_stats(j)->kvm_async_pf_pageready_count);
+		seq_puts(p, "  KVM async PF page ready interrupts\n");
+	}
 #endif
 	seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
 #if defined(CONFIG_X86_IO_APIC)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6efe0410fb72..b6453d227604 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -245,23 +245,40 @@ NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
 dotraplinkage void
 do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	switch (kvm_read_and_reset_pf_reason()) {
-	default:
-		do_page_fault(regs, error_code, address);
-		break;
+	u32 reason = kvm_read_and_reset_pf_reason();
+
+	switch (reason) {
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
 		kvm_async_pf_task_wait((u32)address, !user_mode(regs));
 		break;
-	case KVM_PV_REASON_PAGE_READY:
-		rcu_irq_enter();
-		kvm_async_pf_task_wake((u32)address);
-		rcu_irq_exit();
+	case 0:
+		do_page_fault(regs, error_code, address);
 		break;
+	default:
+		WARN_ONCE(1, "Unexpected async PF reason: %x\n", reason);
 	}
 }
 NOKPROBE_SYMBOL(do_async_page_fault);
 
+__visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
+{
+	u32 token, reason;
+
+	entering_ack_irq();
+
+	inc_irq_stat(kvm_async_pf_pageready_count);
+
+	if (__this_cpu_read(apf_reason.enabled)) {
+		token = __this_cpu_read(apf_reason.pageready_token);
+		kvm_async_pf_task_wake(token);
+		__this_cpu_write(apf_reason.pageready_token, 0);
+		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
+	}
+
+	exiting_irq();
+}
+
 static void __init paravirt_ops_setup(void)
 {
 	pv_info.name = "KVM";
@@ -305,17 +322,19 @@ static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
 
 static void kvm_guest_cpu_init(void)
 {
-	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) {
 		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
 
 #ifdef CONFIG_PREEMPTION
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
 #endif
-		pa |= KVM_ASYNC_PF_ENABLED;
+		pa |= KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
 
 		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT))
 			pa |= KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
 
+		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
+
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
 		__this_cpu_write(apf_reason.enabled, 1);
 		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
@@ -649,6 +668,9 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT))
+		alloc_intr_gate(KVM_ASYNC_PF_VECTOR, kvm_async_pf_vector);
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/8] KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault()
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (6 preceding siblings ...)
  2020-05-11 16:47 ` [PATCH 7/8] KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery Vitaly Kuznetsov
@ 2020-05-11 16:47 ` Vitaly Kuznetsov
  2020-05-12 15:32 ` [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vivek Goyal
  2020-05-13 14:16 ` Vivek Goyal
  9 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-11 16:47 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Wanpeng Li, Sean Christopherson, Jim Mattson,
	Vivek Goyal, Gavin Shan, Peter Zijlstra, linux-kernel

KVM guest code in Linux enables APF only when KVM_FEATURE_ASYNC_PF_INT
is supported, this means we will never see KVM_PV_REASON_PAGE_READY
when handling page fault vmexit in KVM.

While on it, make sure we only follow genuine page fault path when
APF reason is zero. If we happen to see something else this means
that the underlying hypervisor is misbehaving. Leave WARN_ON_ONCE()
to catch that.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8071952e9cf2..5a9fca908ca9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4187,7 +4187,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 
 	vcpu->arch.l1tf_flush_l1d = true;
 	switch (vcpu->arch.apf.host_apf_reason) {
-	default:
+	case 0:
 		trace_kvm_page_fault(fault_address, error_code);
 
 		if (kvm_event_needs_reinjection(vcpu))
@@ -4201,12 +4201,8 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 		kvm_async_pf_task_wait(fault_address, 0);
 		local_irq_enable();
 		break;
-	case KVM_PV_REASON_PAGE_READY:
-		vcpu->arch.apf.host_apf_reason = 0;
-		local_irq_disable();
-		kvm_async_pf_task_wake(fault_address);
-		local_irq_enable();
-		break;
+	default:
+		WARN_ON_ONCE(1);
 	}
 	return r;
 }
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-11 16:47 ` [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery Vitaly Kuznetsov
@ 2020-05-12 14:24   ` Vivek Goyal
  2020-05-12 15:50     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-12 14:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Mon, May 11, 2020 at 06:47:48PM +0200, Vitaly Kuznetsov wrote:
> Concerns were expressed around APF delivery via synthetic #PF exception as
> in some cases such delivery may collide with real page fault. For type 2
> (page ready) notifications we can easily switch to using an interrupt
> instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
> legacy one.
> 
> One notable difference between the two mechanisms is that interrupt may not
> get handled immediately so whenever we would like to deliver next event
> (regardless of its type) we must be sure the guest had read and cleared
> previous event in the slot.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
>  arch/x86/include/asm/kvm_host.h      |  4 +-
>  arch/x86/include/uapi/asm/kvm_para.h |  6 ++
>  arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
>  4 files changed, 140 insertions(+), 56 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index 33892036672d..f988a36f226a 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
>  	0x4b564d02
>  
>  data:
> -	Bits 63-6 hold 64-byte aligned physical address of a
> -	64 byte memory area which must be in guest RAM and must be
> -	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
> -	when asynchronous page faults are enabled on the vcpu 0 when
> -	disabled. Bit 1 is 1 if asynchronous page faults can be injected
> -	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
> -	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
> -	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
> -
> -	First 4 byte of 64 byte memory location will be written to by
> -	the hypervisor at the time of asynchronous page fault (APF)
> -	injection to indicate type of asynchronous page fault. Value
> -	of 1 means that the page referred to by the page fault is not
> -	present. Value 2 means that the page is now available. Disabling
> -	interrupt inhibits APFs. Guest must not enable interrupt
> -	before the reason is read, or it may be overwritten by another
> -	APF. Since APF uses the same exception vector as regular page
> -	fault guest must reset the reason to 0 before it does
> -	something that can generate normal page fault.  If during page
> -	fault APF reason is 0 it means that this is regular page
> -	fault.
> -
> -	During delivery of type 1 APF cr2 contains a token that will
> -	be used to notify a guest when missing page becomes
> -	available. When page becomes available type 2 APF is sent with
> -	cr2 set to the token associated with the page. There is special
> -	kind of token 0xffffffff which tells vcpu that it should wake
> -	up all processes waiting for APFs and no individual type 2 APFs
> -	will be sent.
> +	Asynchronous page fault (APF) control MSR.
> +
> +	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
> +	which must be in guest RAM and must be zeroed. This memory is expected
> +	to hold a copy of the following structure::
> +
> +	  struct kvm_vcpu_pv_apf_data {
> +		__u32 reason;
> +		__u32 pageready_token;
> +		__u8 pad[56];
> +		__u32 enabled;
> +	  };
> +
> +	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> +	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
> +	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
> +	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
> +	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
> +	present in CPUID. Bit 3 enables interrupt based delivery of type 2
> +	(page present) events.

Hi Vitaly,

"Bit 3 enables interrupt based delivery of type 2 events". So one has to
opt in to enable it. If this bit is 0, we will continue to deliver
page ready events using #PF? This probably will be needed to ensure
backward compatibility also.

> +
> +	First 4 byte of 64 byte memory location ('reason') will be written to
> +	by the hypervisor at the time APF type 1 (page not present) injection.
> +	The only possible values are '0' and '1'.

What do "reason" values "0" and "1" signify?

Previously this value could be 1 for PAGE_NOT_PRESENT and 2 for
PAGE_READY. So looks like we took away reason "PAGE_READY" because it will
be delivered using interrupts.

But that seems like an opt in. If that's the case, then we should still
retain PAGE_READY reason. If we are getting rid of page_ready using
#PF, then interrupt based deliver should not be optional. What am I
missing.

Also previous text had following line.

"Guest must not enable interrupt before the reason is read, or it may be
 overwritten by another APF".

So this is not a requirement anymore?

> Type 1 events are currently
> +	always delivered as synthetic #PF exception. During delivery of type 1
> +	APF CR2 register contains a token that will be used to notify the guest
> +	when missing page becomes available. Guest is supposed to write '0' to
> +	the location when it is done handling type 1 event so the next one can
> +	be delivered.
> +
> +	Note, since APF type 1 uses the same exception vector as regular page
> +	fault, guest must reset the reason to '0' before it does something that
> +	can generate normal page fault. If during a page fault APF reason is '0'
> +	it means that this is regular page fault.
> +
> +	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
> +	to by the hypervisor at the time of type 2 (page ready) event injection.
> +	The content of these bytes is a token which was previously delivered as
> +	type 1 event. The event indicates the page in now available. Guest is
> +	supposed to write '0' to the location when it is done handling type 2
> +	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
> +	specifying the interrupt vector for type 2 APF delivery needs to be
> +	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.

What is supposed to be value of "reason" field for type2 events. I
had liked previous values "KVM_PV_REASON_PAGE_READY" and
"KVM_PV_REASON_PAGE_NOT_PRESENT". Name itself made it plenty clear, what
it means. Also it allowed for easy extension where this protocol could
be extended to deliver other "reasons", like error.

So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
type1 and type2 events, to me it makes sense to retain existing
KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
will be delivered using interrupt.

> +
> +	Note, previously, type 2 (page present) events were delivered via the
> +	same #PF exception as type 1 (page not present) events but this is
> +	now deprecated.

> If bit 3 (interrupt based delivery) is not set APF events are not delivered.

So all the old guests which were getting async pf will suddenly find
that async pf does not work anymore (after hypervisor update). And
some of them might report it as performance issue (if there were any
performance benefits to be had with async pf).

[..]
>  
>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>  {
> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> +	if (!kvm_pv_async_pf_enabled(vcpu))
>  		return true;

What does above mean. If async pf is not enabled, then it returns true,
implying one can inject async page present. But if async pf is not
enabled, there is no need to inject these events.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-11 16:47 ` [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Vitaly Kuznetsov
@ 2020-05-12 15:27   ` Vivek Goyal
  2020-05-12 15:40     ` Vitaly Kuznetsov
  2020-05-21 18:38   ` Vivek Goyal
  1 sibling, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-12 15:27 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> Currently, APF mechanism relies on the #PF abuse where the token is being
> passed through CR2. If we switch to using interrupts to deliver page-ready
> notifications we need a different way to pass the data. Extent the existing
> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> notifications.
> 
> The newly introduced apf_put_user_ready() temporary puts both reason
> and token information, this will be changed to put token only when we
> switch to interrupt based notifications.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
>  2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 2a8e0b6b9805..e3602a1de136 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
>  
>  struct kvm_vcpu_pv_apf_data {
>  	__u32 reason;
> -	__u8 pad[60];
> +	__u32 pageready_token;

How about naming this just "token". That will allow me to deliver error
as well. pageready_token name seems to imply that this will always be
successful with page being ready.

And reason will tell whether page could successfully be ready or
it was an error. And token will help us identify the task which
is waiting for the event.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (7 preceding siblings ...)
  2020-05-11 16:47 ` [PATCH 8/8] KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault() Vitaly Kuznetsov
@ 2020-05-12 15:32 ` Vivek Goyal
  2020-05-12 16:12   ` Vitaly Kuznetsov
  2020-05-13 14:16 ` Vivek Goyal
  9 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-12 15:32 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Hi Vitaly,

Are there any corresponding qemu patches as well to enable new
functionality. Wanted to test it.

Thanks
Vivek

On Mon, May 11, 2020 at 06:47:44PM +0200, Vitaly Kuznetsov wrote:
> Concerns were expressed around (ab)using #PF for KVM's async_pf mechanism,
> it seems that re-using #PF exception for a PV mechanism wasn't a great
> idea after all. The Grand Plan is to switch to using e.g. #VE for 'page
> not present' events and normal APIC interrupts for 'page ready' events.
> This series does the later.
> 
> Changes since RFC:
> - Using #PF for 'page ready' is deprecated and removed [Paolo Bonzini]
> - 'reason' field in 'struct kvm_vcpu_pv_apf_data' is not used for 'page ready'
>   notifications and 'pageready_token' is not used for 'page not present' events
>   [Paolo Bonzini]
> - Renamed MSR_KVM_ASYNC_PF2 -> MSR_KVM_ASYNC_PF_INT [Peter Xu]
> - Drop 'enabled' field from MSR_KVM_ASYNC_PF_INT [Peter Xu]
> - Other minor changes supporting the above.
> 
> Vitaly Kuznetsov (8):
>   Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and
>     "Page Ready" exceptions simultaneously"
>   KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
>   KVM: introduce kvm_read_guest_offset_cached()
>   KVM: x86: interrupt based APF page-ready event delivery
>   KVM: x86: acknowledgment mechanism for async pf page ready
>     notifications
>   KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
>   KVM: x86: Switch KVM guest to using interrupts for page ready APF
>     delivery
>   KVM: x86: drop KVM_PV_REASON_PAGE_READY case from
>     kvm_handle_page_fault()
> 
>  Documentation/virt/kvm/cpuid.rst     |   6 ++
>  Documentation/virt/kvm/msr.rst       | 106 ++++++++++++++------
>  arch/s390/include/asm/kvm_host.h     |   2 +
>  arch/x86/entry/entry_32.S            |   5 +
>  arch/x86/entry/entry_64.S            |   5 +
>  arch/x86/include/asm/hardirq.h       |   3 +
>  arch/x86/include/asm/irq_vectors.h   |   6 +-
>  arch/x86/include/asm/kvm_host.h      |   7 +-
>  arch/x86/include/asm/kvm_para.h      |   6 ++
>  arch/x86/include/uapi/asm/kvm_para.h |  11 ++-
>  arch/x86/kernel/irq.c                |   9 ++
>  arch/x86/kernel/kvm.c                |  42 ++++++--
>  arch/x86/kvm/cpuid.c                 |   3 +-
>  arch/x86/kvm/mmu/mmu.c               |  10 +-
>  arch/x86/kvm/x86.c                   | 142 ++++++++++++++++++---------
>  include/linux/kvm_host.h             |   3 +
>  include/uapi/linux/kvm.h             |   1 +
>  virt/kvm/async_pf.c                  |  10 ++
>  virt/kvm/kvm_main.c                  |  19 +++-
>  19 files changed, 295 insertions(+), 101 deletions(-)
> 
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-12 15:27   ` Vivek Goyal
@ 2020-05-12 15:40     ` Vitaly Kuznetsov
  2020-05-12 15:53       ` Vivek Goyal
  2020-05-12 21:15       ` Vivek Goyal
  0 siblings, 2 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-12 15:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
>> Currently, APF mechanism relies on the #PF abuse where the token is being
>> passed through CR2. If we switch to using interrupts to deliver page-ready
>> notifications we need a different way to pass the data. Extent the existing
>> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
>> notifications.
>> 
>> The newly introduced apf_put_user_ready() temporary puts both reason
>> and token information, this will be changed to put token only when we
>> switch to interrupt based notifications.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
>>  2 files changed, 15 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
>> index 2a8e0b6b9805..e3602a1de136 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
>>  
>>  struct kvm_vcpu_pv_apf_data {
>>  	__u32 reason;
>> -	__u8 pad[60];
>> +	__u32 pageready_token;
>
> How about naming this just "token". That will allow me to deliver error
> as well. pageready_token name seems to imply that this will always be
> successful with page being ready.
>
> And reason will tell whether page could successfully be ready or
> it was an error. And token will help us identify the task which
> is waiting for the event.

I added 'pageready_' prefix to make it clear this is not used for 'page
not present' notifications where we pass token through CR2. (BTW
'reason' also becomes a misnomer because we can only see
'KVM_PV_REASON_PAGE_NOT_PRESENT' there.)

I have no strong opinion, can definitely rename this to 'token' and add
a line to the documentation to re-state that this is not used for type 1
events.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-12 14:24   ` Vivek Goyal
@ 2020-05-12 15:50     ` Vitaly Kuznetsov
  2020-05-12 18:07       ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-12 15:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, May 11, 2020 at 06:47:48PM +0200, Vitaly Kuznetsov wrote:
>> Concerns were expressed around APF delivery via synthetic #PF exception as
>> in some cases such delivery may collide with real page fault. For type 2
>> (page ready) notifications we can easily switch to using an interrupt
>> instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
>> legacy one.
>> 
>> One notable difference between the two mechanisms is that interrupt may not
>> get handled immediately so whenever we would like to deliver next event
>> (regardless of its type) we must be sure the guest had read and cleared
>> previous event in the slot.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
>>  arch/x86/include/asm/kvm_host.h      |  4 +-
>>  arch/x86/include/uapi/asm/kvm_para.h |  6 ++
>>  arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
>>  4 files changed, 140 insertions(+), 56 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
>> index 33892036672d..f988a36f226a 100644
>> --- a/Documentation/virt/kvm/msr.rst
>> +++ b/Documentation/virt/kvm/msr.rst
>> @@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
>>  	0x4b564d02
>>  
>>  data:
>> -	Bits 63-6 hold 64-byte aligned physical address of a
>> -	64 byte memory area which must be in guest RAM and must be
>> -	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
>> -	when asynchronous page faults are enabled on the vcpu 0 when
>> -	disabled. Bit 1 is 1 if asynchronous page faults can be injected
>> -	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
>> -	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
>> -	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
>> -
>> -	First 4 byte of 64 byte memory location will be written to by
>> -	the hypervisor at the time of asynchronous page fault (APF)
>> -	injection to indicate type of asynchronous page fault. Value
>> -	of 1 means that the page referred to by the page fault is not
>> -	present. Value 2 means that the page is now available. Disabling
>> -	interrupt inhibits APFs. Guest must not enable interrupt
>> -	before the reason is read, or it may be overwritten by another
>> -	APF. Since APF uses the same exception vector as regular page
>> -	fault guest must reset the reason to 0 before it does
>> -	something that can generate normal page fault.  If during page
>> -	fault APF reason is 0 it means that this is regular page
>> -	fault.
>> -
>> -	During delivery of type 1 APF cr2 contains a token that will
>> -	be used to notify a guest when missing page becomes
>> -	available. When page becomes available type 2 APF is sent with
>> -	cr2 set to the token associated with the page. There is special
>> -	kind of token 0xffffffff which tells vcpu that it should wake
>> -	up all processes waiting for APFs and no individual type 2 APFs
>> -	will be sent.
>> +	Asynchronous page fault (APF) control MSR.
>> +
>> +	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
>> +	which must be in guest RAM and must be zeroed. This memory is expected
>> +	to hold a copy of the following structure::
>> +
>> +	  struct kvm_vcpu_pv_apf_data {
>> +		__u32 reason;
>> +		__u32 pageready_token;
>> +		__u8 pad[56];
>> +		__u32 enabled;
>> +	  };
>> +
>> +	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
>> +	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
>> +	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
>> +	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
>> +	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
>> +	present in CPUID. Bit 3 enables interrupt based delivery of type 2
>> +	(page present) events.
>
> Hi Vitaly,
>
> "Bit 3 enables interrupt based delivery of type 2 events". So one has to
> opt in to enable it. If this bit is 0, we will continue to deliver
> page ready events using #PF? This probably will be needed to ensure
> backward compatibility also.
>

No, as Paolo suggested we don't enable the mechanism at all if bit3 is
0. Legacy (unaware) guests will think that they've enabled the mechanism
but it won't work, they won't see any APF notifications.

>> +
>> +	First 4 byte of 64 byte memory location ('reason') will be written to
>> +	by the hypervisor at the time APF type 1 (page not present) injection.
>> +	The only possible values are '0' and '1'.
>
> What do "reason" values "0" and "1" signify?
>
> Previously this value could be 1 for PAGE_NOT_PRESENT and 2 for
> PAGE_READY. So looks like we took away reason "PAGE_READY" because it will
> be delivered using interrupts.
>
> But that seems like an opt in. If that's the case, then we should still
> retain PAGE_READY reason. If we are getting rid of page_ready using
> #PF, then interrupt based deliver should not be optional. What am I
> missing.

It is not optional now :-)

>
> Also previous text had following line.
>
> "Guest must not enable interrupt before the reason is read, or it may be
>  overwritten by another APF".
>
> So this is not a requirement anymore?
>

It still stands for type 1 (page not present) events.

>> Type 1 events are currently
>> +	always delivered as synthetic #PF exception. During delivery of type 1
>> +	APF CR2 register contains a token that will be used to notify the guest
>> +	when missing page becomes available. Guest is supposed to write '0' to
>> +	the location when it is done handling type 1 event so the next one can
>> +	be delivered.
>> +
>> +	Note, since APF type 1 uses the same exception vector as regular page
>> +	fault, guest must reset the reason to '0' before it does something that
>> +	can generate normal page fault. If during a page fault APF reason is '0'
>> +	it means that this is regular page fault.
>> +
>> +	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
>> +	to by the hypervisor at the time of type 2 (page ready) event injection.
>> +	The content of these bytes is a token which was previously delivered as
>> +	type 1 event. The event indicates the page in now available. Guest is
>> +	supposed to write '0' to the location when it is done handling type 2
>> +	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
>> +	specifying the interrupt vector for type 2 APF delivery needs to be
>> +	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
>
> What is supposed to be value of "reason" field for type2 events. I
> had liked previous values "KVM_PV_REASON_PAGE_READY" and
> "KVM_PV_REASON_PAGE_NOT_PRESENT". Name itself made it plenty clear, what
> it means. Also it allowed for easy extension where this protocol could
> be extended to deliver other "reasons", like error.
>
> So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
> type1 and type2 events, to me it makes sense to retain existing
> KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
> in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
> using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
> will be delivered using interrupt.

We use different fields for page-not-present and page-ready events so
there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
to 'reason' we may accidentally destroy a 'page-not-present' event.

With this patchset we have two completely separate channels:
1) Page-not-present goes through #PF and 'reason' in struct
kvm_vcpu_pv_apf_data.
2) Page-ready goes through interrupt and 'pageready_token' in the same
kvm_vcpu_pv_apf_data.

>
>> +
>> +	Note, previously, type 2 (page present) events were delivered via the
>> +	same #PF exception as type 1 (page not present) events but this is
>> +	now deprecated.
>
>> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
>
> So all the old guests which were getting async pf will suddenly find
> that async pf does not work anymore (after hypervisor update). And
> some of them might report it as performance issue (if there were any
> performance benefits to be had with async pf).

We still do APF_HALT but generally yes, there might be some performance
implications. My RFC was preserving #PF path but the suggestion was to
retire it completely. (and I kinda like it because it makes a lot of
code go away)

>
> [..]
>>  
>>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>>  {
>> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> +	if (!kvm_pv_async_pf_enabled(vcpu))
>>  		return true;
>
> What does above mean. If async pf is not enabled, then it returns true,
> implying one can inject async page present. But if async pf is not
> enabled, there is no need to inject these events.

AFAIU this is a protection agains guest suddenly disabling APF
mechanism. What do we do with all the 'page ready' events after, we
can't deliver them anymore. So we just eat them (hoping guest will
unfreeze all processes on its own before disabling the mechanism).

It is the existing logic, my patch doesn't change it.

Thanks for the review!

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-12 15:40     ` Vitaly Kuznetsov
@ 2020-05-12 15:53       ` Vivek Goyal
  2020-05-12 17:50         ` Sean Christopherson
  2020-05-12 21:15       ` Vivek Goyal
  1 sibling, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-12 15:53 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Tue, May 12, 2020 at 05:40:10PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> >> Currently, APF mechanism relies on the #PF abuse where the token is being
> >> passed through CR2. If we switch to using interrupts to deliver page-ready
> >> notifications we need a different way to pass the data. Extent the existing
> >> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> >> notifications.
> >> 
> >> The newly introduced apf_put_user_ready() temporary puts both reason
> >> and token information, this will be changed to put token only when we
> >> switch to interrupt based notifications.
> >> 
> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> ---
> >>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
> >>  2 files changed, 15 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> >> index 2a8e0b6b9805..e3602a1de136 100644
> >> --- a/arch/x86/include/uapi/asm/kvm_para.h
> >> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> >> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
> >>  
> >>  struct kvm_vcpu_pv_apf_data {
> >>  	__u32 reason;
> >> -	__u8 pad[60];
> >> +	__u32 pageready_token;
> >
> > How about naming this just "token". That will allow me to deliver error
> > as well. pageready_token name seems to imply that this will always be
> > successful with page being ready.
> >
> > And reason will tell whether page could successfully be ready or
> > it was an error. And token will help us identify the task which
> > is waiting for the event.
> 
> I added 'pageready_' prefix to make it clear this is not used for 'page
> not present' notifications where we pass token through CR2. (BTW
> 'reason' also becomes a misnomer because we can only see
> 'KVM_PV_REASON_PAGE_NOT_PRESENT' there.)

Sure. I am just trying to keep names in such a way so that we could
deliver more events and not keep it too tightly coupled with only
two events (page not present, page ready).

> 
> I have no strong opinion, can definitely rename this to 'token' and add
> a line to the documentation to re-state that this is not used for type 1
> events.

I don't even know why are we calling "type 1" and "type 2" event. Calling
it KVM_PV_REASON_PAGE_NOT_PRESENT  and KVM_PV_REASON_PAGE_READY event
is much more intuitive. If somebody is confused about how event will
be delivered, that could be part of documentation. And "type1" and "type2"
does not say anything about delivery method anyway.

Also, type of event should not necessarily be tied to delivery method.
For example if we end up introducing say, "KVM_PV_REASON_PAGE_ERROR", then
I would think that event can be injected both using exception (#PF or #VE)
as well as interrupt (depending on state of system).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications
  2020-05-12 15:32 ` [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vivek Goyal
@ 2020-05-12 16:12   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-12 16:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 394 bytes --]

Vivek Goyal <vgoyal@redhat.com> writes:

> Hi Vitaly,
>
> Are there any corresponding qemu patches as well to enable new
> functionality. Wanted to test it.
>

Yes, right you are, I forgot to even mention this in the blurb.
Please find patches against current 'master' attached. With '-cpu host'
the feature gets enabled automatically.

Note, guest kernel needs to be updated too.

-- 
Vitaly


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-linux-headers-KVM_FEATURE_ASYNC_PF_INT-update.patch --]
[-- Type: text/x-patch, Size: 2162 bytes --]

From 24d78c031f5348764f880698b01b574ca8748ea4 Mon Sep 17 00:00:00 2001
From: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Tue, 12 May 2020 18:03:53 +0200
Subject: [PATCH 1/2] linux headers: KVM_FEATURE_ASYNC_PF_INT update

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 include/standard-headers/asm-x86/kvm_para.h | 11 ++++++++++-
 linux-headers/linux/kvm.h                   |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/standard-headers/asm-x86/kvm_para.h b/include/standard-headers/asm-x86/kvm_para.h
index 90604a8fb77b..77183ded2cca 100644
--- a/include/standard-headers/asm-x86/kvm_para.h
+++ b/include/standard-headers/asm-x86/kvm_para.h
@@ -31,6 +31,7 @@
 #define KVM_FEATURE_PV_SEND_IPI	11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
+#define KVM_FEATURE_ASYNC_PF_INT	14
 
 #define KVM_HINTS_REALTIME      0
 
@@ -50,6 +51,8 @@
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_ASYNC_PF_INT	0x4b564d06
+#define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
 
 struct kvm_steal_time {
 	uint64_t steal;
@@ -81,6 +84,11 @@ struct kvm_clock_pairing {
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
 #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
+#define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
+
+/* MSR_KVM_ASYNC_PF_INT */
+#define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
+
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
@@ -113,7 +121,8 @@ struct kvm_mmu_op_release_pt {
 
 struct kvm_vcpu_pv_apf_data {
 	uint32_t reason;
-	uint8_t pad[60];
+	uint32_t pageready_token;
+	uint8_t pad[56];
 	uint32_t enabled;
 };
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 9804495a46c5..3d0216d1c73f 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_ASYNC_PF_INT 182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.25.4


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-WIP-ASYNC_PF_INT.patch --]
[-- Type: text/x-patch, Size: 4687 bytes --]

From f1db3c1cb5d40140028ddae812bfa924ca247556 Mon Sep 17 00:00:00 2001
From: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Fri, 24 Apr 2020 15:09:45 +0200
Subject: [PATCH 2/2] WIP: ASYNC_PF_INT

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 target/i386/cpu.c     |  2 +-
 target/i386/cpu.h     |  1 +
 target/i386/kvm.c     | 10 ++++++++++
 target/i386/machine.c | 19 +++++++++++++++++++
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 9c256ab15910..ea794f3f52b5 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -810,7 +810,7 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
             "kvmclock", "kvm-nopiodelay", "kvm-mmu", "kvmclock",
             "kvm-asyncpf", "kvm-steal-time", "kvm-pv-eoi", "kvm-pv-unhalt",
             NULL, "kvm-pv-tlb-flush", NULL, "kvm-pv-ipi",
-            "kvm-poll-control", "kvm-pv-sched-yield", NULL, NULL,
+            "kvm-poll-control", "kvm-pv-sched-yield", "kvm-asyncpf-int", NULL,
             NULL, NULL, NULL, NULL,
             NULL, NULL, NULL, NULL,
             "kvmclock-stable-bit", NULL, NULL, NULL,
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index e818fc712aca..673e31e191cc 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1474,6 +1474,7 @@ typedef struct CPUX86State {
     uint64_t wall_clock_msr;
     uint64_t steal_time_msr;
     uint64_t async_pf_en_msr;
+    uint64_t async_pf_int_msr;
     uint64_t pv_eoi_en_msr;
     uint64_t poll_control_msr;
 
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 4901c6dd747d..d2b286f9dcb9 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -283,6 +283,7 @@ static const struct kvm_para_features {
     { KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
     { KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
     { KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
+    { KVM_CAP_ASYNC_PF_INT, KVM_FEATURE_ASYNC_PF_INT },
 };
 
 static int get_para_features(KVMState *s)
@@ -2798,6 +2799,9 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
         if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF)) {
             kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_EN, env->async_pf_en_msr);
         }
+        if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF_INT)) {
+            kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_EN, env->async_pf_int_msr);
+        }
         if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_PV_EOI)) {
             kvm_msr_entry_add(cpu, MSR_KVM_PV_EOI_EN, env->pv_eoi_en_msr);
         }
@@ -3183,6 +3187,9 @@ static int kvm_get_msrs(X86CPU *cpu)
     if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF)) {
         kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_EN, 0);
     }
+    if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_ASYNC_PF_INT)) {
+        kvm_msr_entry_add(cpu, MSR_KVM_ASYNC_PF_INT, 0);
+    }
     if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_PV_EOI)) {
         kvm_msr_entry_add(cpu, MSR_KVM_PV_EOI_EN, 0);
     }
@@ -3423,6 +3430,9 @@ static int kvm_get_msrs(X86CPU *cpu)
         case MSR_KVM_ASYNC_PF_EN:
             env->async_pf_en_msr = msrs[i].data;
             break;
+        case MSR_KVM_ASYNC_PF_INT:
+            env->async_pf_int_msr = msrs[i].data;
+            break;
         case MSR_KVM_PV_EOI_EN:
             env->pv_eoi_en_msr = msrs[i].data;
             break;
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 0c96531a56f0..dc124cf57de7 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -394,6 +394,13 @@ static bool async_pf_msr_needed(void *opaque)
     return cpu->env.async_pf_en_msr != 0;
 }
 
+static bool async_pf_int_msr_needed(void *opaque)
+{
+    X86CPU *cpu = opaque;
+
+    return cpu->env.async_pf_int_msr != 0;
+}
+
 static bool pv_eoi_msr_needed(void *opaque)
 {
     X86CPU *cpu = opaque;
@@ -467,6 +474,17 @@ static const VMStateDescription vmstate_async_pf_msr = {
     }
 };
 
+static const VMStateDescription vmstate_async_pf_int_msr = {
+    .name = "cpu/async_pf_int_msr",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = async_pf_int_msr_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(env.async_pf_int_msr, X86CPU),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static const VMStateDescription vmstate_pv_eoi_msr = {
     .name = "cpu/async_pv_eoi_msr",
     .version_id = 1,
@@ -1409,6 +1427,7 @@ VMStateDescription vmstate_x86_cpu = {
     .subsections = (const VMStateDescription*[]) {
         &vmstate_exception_info,
         &vmstate_async_pf_msr,
+        &vmstate_async_pf_int_msr,
         &vmstate_pv_eoi_msr,
         &vmstate_steal_time_msr,
         &vmstate_poll_control_msr,
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-12 15:53       ` Vivek Goyal
@ 2020-05-12 17:50         ` Sean Christopherson
  2020-05-13  9:09           ` Vitaly Kuznetsov
  2020-05-13 12:52           ` Vivek Goyal
  0 siblings, 2 replies; 43+ messages in thread
From: Sean Christopherson @ 2020-05-12 17:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Vitaly Kuznetsov, kvm, x86, Paolo Bonzini, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Tue, May 12, 2020 at 11:53:39AM -0400, Vivek Goyal wrote:
> On Tue, May 12, 2020 at 05:40:10PM +0200, Vitaly Kuznetsov wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> > >> Currently, APF mechanism relies on the #PF abuse where the token is being
> > >> passed through CR2. If we switch to using interrupts to deliver page-ready
> > >> notifications we need a different way to pass the data. Extent the existing
> > >> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> > >> notifications.
> > >> 
> > >> The newly introduced apf_put_user_ready() temporary puts both reason
> > >> and token information, this will be changed to put token only when we
> > >> switch to interrupt based notifications.
> > >> 
> > >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > >> ---
> > >>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> > >>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
> > >>  2 files changed, 15 insertions(+), 5 deletions(-)
> > >> 
> > >> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > >> index 2a8e0b6b9805..e3602a1de136 100644
> > >> --- a/arch/x86/include/uapi/asm/kvm_para.h
> > >> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > >> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
> > >>  
> > >>  struct kvm_vcpu_pv_apf_data {
> > >>  	__u32 reason;
> > >> -	__u8 pad[60];
> > >> +	__u32 pageready_token;
> > >
> > > How about naming this just "token". That will allow me to deliver error
> > > as well. pageready_token name seems to imply that this will always be
> > > successful with page being ready.
> > >
> > > And reason will tell whether page could successfully be ready or
> > > it was an error. And token will help us identify the task which
> > > is waiting for the event.
> > 
> > I added 'pageready_' prefix to make it clear this is not used for 'page
> > not present' notifications where we pass token through CR2. (BTW
> > 'reason' also becomes a misnomer because we can only see
> > 'KVM_PV_REASON_PAGE_NOT_PRESENT' there.)
> 
> Sure. I am just trying to keep names in such a way so that we could
> deliver more events and not keep it too tightly coupled with only
> two events (page not present, page ready).
> 
> > 
> > I have no strong opinion, can definitely rename this to 'token' and add
> > a line to the documentation to re-state that this is not used for type 1
> > events.
> 
> I don't even know why are we calling "type 1" and "type 2" event. Calling
> it KVM_PV_REASON_PAGE_NOT_PRESENT  and KVM_PV_REASON_PAGE_READY event
> is much more intuitive. If somebody is confused about how event will
> be delivered, that could be part of documentation. And "type1" and "type2"
> does not say anything about delivery method anyway.
> 
> Also, type of event should not necessarily be tied to delivery method.
> For example if we end up introducing say, "KVM_PV_REASON_PAGE_ERROR", then
> I would think that event can be injected both using exception (#PF or #VE)
> as well as interrupt (depending on state of system).

Why bother preserving backwards compatibility?  AIUI, both KVM and guest
will support async #PF iff interrupt delivery is enabled.  Why not make
the interrupt delivery approach KVM_ASYNC_PF_V2 and completely redefine the
ABI?  E.g. to make it compatible with reflecting !PRESENT faults without a
VM-Exit via Intel's EPT Violation #VE?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-12 15:50     ` Vitaly Kuznetsov
@ 2020-05-12 18:07       ` Vivek Goyal
  2020-05-13  9:03         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-12 18:07 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, May 11, 2020 at 06:47:48PM +0200, Vitaly Kuznetsov wrote:
> >> Concerns were expressed around APF delivery via synthetic #PF exception as
> >> in some cases such delivery may collide with real page fault. For type 2
> >> (page ready) notifications we can easily switch to using an interrupt
> >> instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
> >> legacy one.
> >> 
> >> One notable difference between the two mechanisms is that interrupt may not
> >> get handled immediately so whenever we would like to deliver next event
> >> (regardless of its type) we must be sure the guest had read and cleared
> >> previous event in the slot.
> >> 
> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> ---
> >>  Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
> >>  arch/x86/include/asm/kvm_host.h      |  4 +-
> >>  arch/x86/include/uapi/asm/kvm_para.h |  6 ++
> >>  arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
> >>  4 files changed, 140 insertions(+), 56 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> >> index 33892036672d..f988a36f226a 100644
> >> --- a/Documentation/virt/kvm/msr.rst
> >> +++ b/Documentation/virt/kvm/msr.rst
> >> @@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
> >>  	0x4b564d02
> >>  
> >>  data:
> >> -	Bits 63-6 hold 64-byte aligned physical address of a
> >> -	64 byte memory area which must be in guest RAM and must be
> >> -	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
> >> -	when asynchronous page faults are enabled on the vcpu 0 when
> >> -	disabled. Bit 1 is 1 if asynchronous page faults can be injected
> >> -	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
> >> -	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
> >> -	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
> >> -
> >> -	First 4 byte of 64 byte memory location will be written to by
> >> -	the hypervisor at the time of asynchronous page fault (APF)
> >> -	injection to indicate type of asynchronous page fault. Value
> >> -	of 1 means that the page referred to by the page fault is not
> >> -	present. Value 2 means that the page is now available. Disabling
> >> -	interrupt inhibits APFs. Guest must not enable interrupt
> >> -	before the reason is read, or it may be overwritten by another
> >> -	APF. Since APF uses the same exception vector as regular page
> >> -	fault guest must reset the reason to 0 before it does
> >> -	something that can generate normal page fault.  If during page
> >> -	fault APF reason is 0 it means that this is regular page
> >> -	fault.
> >> -
> >> -	During delivery of type 1 APF cr2 contains a token that will
> >> -	be used to notify a guest when missing page becomes
> >> -	available. When page becomes available type 2 APF is sent with
> >> -	cr2 set to the token associated with the page. There is special
> >> -	kind of token 0xffffffff which tells vcpu that it should wake
> >> -	up all processes waiting for APFs and no individual type 2 APFs
> >> -	will be sent.
> >> +	Asynchronous page fault (APF) control MSR.
> >> +
> >> +	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
> >> +	which must be in guest RAM and must be zeroed. This memory is expected
> >> +	to hold a copy of the following structure::
> >> +
> >> +	  struct kvm_vcpu_pv_apf_data {
> >> +		__u32 reason;
> >> +		__u32 pageready_token;
> >> +		__u8 pad[56];
> >> +		__u32 enabled;
> >> +	  };
> >> +
> >> +	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> >> +	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
> >> +	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
> >> +	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
> >> +	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
> >> +	present in CPUID. Bit 3 enables interrupt based delivery of type 2
> >> +	(page present) events.
> >
> > Hi Vitaly,
> >
> > "Bit 3 enables interrupt based delivery of type 2 events". So one has to
> > opt in to enable it. If this bit is 0, we will continue to deliver
> > page ready events using #PF? This probably will be needed to ensure
> > backward compatibility also.
> >
> 
> No, as Paolo suggested we don't enable the mechanism at all if bit3 is
> 0. Legacy (unaware) guests will think that they've enabled the mechanism
> but it won't work, they won't see any APF notifications.
> 
> >> +
> >> +	First 4 byte of 64 byte memory location ('reason') will be written to
> >> +	by the hypervisor at the time APF type 1 (page not present) injection.
> >> +	The only possible values are '0' and '1'.
> >
> > What do "reason" values "0" and "1" signify?
> >
> > Previously this value could be 1 for PAGE_NOT_PRESENT and 2 for
> > PAGE_READY. So looks like we took away reason "PAGE_READY" because it will
> > be delivered using interrupts.
> >
> > But that seems like an opt in. If that's the case, then we should still
> > retain PAGE_READY reason. If we are getting rid of page_ready using
> > #PF, then interrupt based deliver should not be optional. What am I
> > missing.
> 
> It is not optional now :-)
> 
> >
> > Also previous text had following line.
> >
> > "Guest must not enable interrupt before the reason is read, or it may be
> >  overwritten by another APF".
> >
> > So this is not a requirement anymore?
> >
> 
> It still stands for type 1 (page not present) events.
> 
> >> Type 1 events are currently
> >> +	always delivered as synthetic #PF exception. During delivery of type 1
> >> +	APF CR2 register contains a token that will be used to notify the guest
> >> +	when missing page becomes available. Guest is supposed to write '0' to
> >> +	the location when it is done handling type 1 event so the next one can
> >> +	be delivered.
> >> +
> >> +	Note, since APF type 1 uses the same exception vector as regular page
> >> +	fault, guest must reset the reason to '0' before it does something that
> >> +	can generate normal page fault. If during a page fault APF reason is '0'
> >> +	it means that this is regular page fault.
> >> +
> >> +	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
> >> +	to by the hypervisor at the time of type 2 (page ready) event injection.
> >> +	The content of these bytes is a token which was previously delivered as
> >> +	type 1 event. The event indicates the page in now available. Guest is
> >> +	supposed to write '0' to the location when it is done handling type 2
> >> +	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
> >> +	specifying the interrupt vector for type 2 APF delivery needs to be
> >> +	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
> >
> > What is supposed to be value of "reason" field for type2 events. I
> > had liked previous values "KVM_PV_REASON_PAGE_READY" and
> > "KVM_PV_REASON_PAGE_NOT_PRESENT". Name itself made it plenty clear, what
> > it means. Also it allowed for easy extension where this protocol could
> > be extended to deliver other "reasons", like error.
> >
> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
> > type1 and type2 events, to me it makes sense to retain existing
> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
> > will be delivered using interrupt.
> 
> We use different fields for page-not-present and page-ready events so
> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
> to 'reason' we may accidentally destroy a 'page-not-present' event.

This is confusing. So you mean at one point of time we might be using
same shared data structure for two events.

- ->reason will be set to 1 and you will inject page_not_present
  execption.

- If some page gets ready, you will now set ->token and queue 
  page ready exception. 

Its very confusing. Can't we serialize the delivery of these events. So
that only one is in progress so that this structure is used by one event
at a time.

Also how do I extend it now to do error delivery. Please keep that in
mind. We don't want to be redesigning this stuff again. Its already
very complicated.

I really need ->reason field to be usable in both the paths so that
error can be delivered.

And this notion of same structure being shared across multiple events
at the same time is just going to create more confusion, IMHO. If we
can decouple it by serializing it, that definitely feels simpler to
understand.

> 
> With this patchset we have two completely separate channels:
> 1) Page-not-present goes through #PF and 'reason' in struct
> kvm_vcpu_pv_apf_data.
> 2) Page-ready goes through interrupt and 'pageready_token' in the same
> kvm_vcpu_pv_apf_data.
> 
> >
> >> +
> >> +	Note, previously, type 2 (page present) events were delivered via the
> >> +	same #PF exception as type 1 (page not present) events but this is
> >> +	now deprecated.
> >
> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
> >
> > So all the old guests which were getting async pf will suddenly find
> > that async pf does not work anymore (after hypervisor update). And
> > some of them might report it as performance issue (if there were any
> > performance benefits to be had with async pf).
> 
> We still do APF_HALT but generally yes, there might be some performance
> implications. My RFC was preserving #PF path but the suggestion was to
> retire it completely. (and I kinda like it because it makes a lot of
> code go away)

Ok. I don't have strong opinion here. If paolo likes it this way, so be
it. :-)

> 
> >
> > [..]
> >>  
> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> >>  {
> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
> >>  		return true;
> >
> > What does above mean. If async pf is not enabled, then it returns true,
> > implying one can inject async page present. But if async pf is not
> > enabled, there is no need to inject these events.
> 
> AFAIU this is a protection agains guest suddenly disabling APF
> mechanism.

Can we provide that protection in MSR implementation. That is once APF
is enabled, it can't be disabled. Or it is a feature that we allow
guest to disable APF and want it that way?

> What do we do with all the 'page ready' events after, we
> can't deliver them anymore. So we just eat them (hoping guest will
> unfreeze all processes on its own before disabling the mechanism).
> 
> It is the existing logic, my patch doesn't change it.

I see its existing logic. Just it is very confusing and will be good
if we can atleast explain it with some comments.

I don't know what to make out of this.

bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
{
        if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
                return true;
        else
                return kvm_can_do_async_pf(vcpu);
}

If feature is disabled, then do inject async pf page present. If feature
is enabled and check whether we can inject async pf right now or not.

It probably will help if this check if feature being enabled/disabled
is outside kvm_arch_can_inject_async_page_present() at the callsite
of kvm_arch_can_inject_async_page_present() and there we explain that
why it is important to inject page ready events despite the fact
that feature is disabled.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-12 15:40     ` Vitaly Kuznetsov
  2020-05-12 15:53       ` Vivek Goyal
@ 2020-05-12 21:15       ` Vivek Goyal
  1 sibling, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-05-12 21:15 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Tue, May 12, 2020 at 05:40:10PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> >> Currently, APF mechanism relies on the #PF abuse where the token is being
> >> passed through CR2. If we switch to using interrupts to deliver page-ready
> >> notifications we need a different way to pass the data. Extent the existing
> >> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> >> notifications.
> >> 
> >> The newly introduced apf_put_user_ready() temporary puts both reason
> >> and token information, this will be changed to put token only when we
> >> switch to interrupt based notifications.
> >> 
> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> ---
> >>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
> >>  2 files changed, 15 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> >> index 2a8e0b6b9805..e3602a1de136 100644
> >> --- a/arch/x86/include/uapi/asm/kvm_para.h
> >> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> >> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
> >>  
> >>  struct kvm_vcpu_pv_apf_data {
> >>  	__u32 reason;
> >> -	__u8 pad[60];
> >> +	__u32 pageready_token;
> >
> > How about naming this just "token". That will allow me to deliver error
> > as well. pageready_token name seems to imply that this will always be
> > successful with page being ready.
> >
> > And reason will tell whether page could successfully be ready or
> > it was an error. And token will help us identify the task which
> > is waiting for the event.
> 
> I added 'pageready_' prefix to make it clear this is not used for 'page
> not present' notifications where we pass token through CR2. (BTW
> 'reason' also becomes a misnomer because we can only see
> 'KVM_PV_REASON_PAGE_NOT_PRESENT' there.)

"kvm_vcpu_pv_apf_data" being shared between two events at the same
time is little concerning. At least there should be clear demarkation
that which events will use which fields.

I guess I could extend "reason" to also report KVM_PV_REASON_ERROR as
long as I make error reporting opt in. That way new code is able to
handle more values and old code will not receive it.

For reporting errors with page ready events, I probably will have to
use more padding bytes to report errors as I can't use reason field anymore.

In your previous posting in one of the emails Paolo mentioned that data
structures for #VE will be separate. If that's the case, then we will end
up changing this protocol one more time. To me it feels that both #VE
changes and these changes should go in together as part of async page fault
redesign V2.

> 
> I have no strong opinion, can definitely rename this to 'token' and add
> a line to the documentation to re-state that this is not used for type 1
> events.

Now I understand that both events could use this shared data at the same
time. So prefixing toke with pageready makes it clear that it has to be
used only with pageready event. So sounds better that way.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-12 18:07       ` Vivek Goyal
@ 2020-05-13  9:03         ` Vitaly Kuznetsov
  2020-05-13 13:53           ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-13  9:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> >
>> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
>> > type1 and type2 events, to me it makes sense to retain existing
>> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
>> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
>> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
>> > will be delivered using interrupt.
>> 
>> We use different fields for page-not-present and page-ready events so
>> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
>> to 'reason' we may accidentally destroy a 'page-not-present' event.
>
> This is confusing. So you mean at one point of time we might be using
> same shared data structure for two events.
>
> - ->reason will be set to 1 and you will inject page_not_present
>   execption.
>
> - If some page gets ready, you will now set ->token and queue 
>   page ready exception. 
>
> Its very confusing. Can't we serialize the delivery of these events. So
> that only one is in progress so that this structure is used by one event
> at a time.

This is not how the mechanism (currently) works:
- A process accesses a page which is swapped out

- We deliver synchronious APF (#PF) to the guest, it freezes the process
and switches to another one.

- Another process accesses a swapped out page, APF is delivered and it
also got frozen

...

- At some point one of the previously unavailable pages become available
(not necessarily the last or the first one) and we deliver this info via
asynchronous APF (now interrupt).

- Note, after we deliver the interrupt and before it is actually
consumed we may have another synchronous APF (#PF) event.

So we really need to separate memory locations for synchronous (type-1,
#PF,...) and asynchronous (type-2, interrupt, ...) data.

The naming is unfortunate and misleading, I agree. What is currently
named 'reason' should be something like 'apf_flag_for_pf' and it just
means to distinguish real #PF from APF. This is going away in the
future, we won't be abusing #PF anymore so I'd keep it as it is now,
maybe add another comment somewhere. The other location is
'pageready_token' and it actually contains the token. This is to stay
long term so any suggestions for better naming are welcome.

We could've separated these two memory locations completely and
e.g. used the remaining 56 bits of MSR_KVM_ASYNC_PF_INT as the new
location information. Maybe we should do that just to avoid the
confusion.

>
> Also how do I extend it now to do error delivery. Please keep that in
> mind. We don't want to be redesigning this stuff again. Its already
> very complicated.
>
> I really need ->reason field to be usable in both the paths so that
> error can be delivered.

If we want to use 'reason' for both we'll get into a weird scenario when
exception is blocking interrupt and, what's more confusing, vice
versa. I'd like to avoid this complexity in KVM code. My suggestion
would be to rename 'reason' to something like 'pf_abuse_flag' so it
doesn't cause the confusion and add new 'reason' after 'token'.

>
> And this notion of same structure being shared across multiple events
> at the same time is just going to create more confusion, IMHO. If we
> can decouple it by serializing it, that definitely feels simpler to
> understand.

What if we just add sub-structures to the structure, e.g. 

struct kvm_vcpu_pv_apf_data {
        struct {
            __u32 apf_flag;
        } legacy_apf_data;
        struct {
            __u32 token;
        } apf_interrupt_data;
        ....
        __u8 pad[56];                                                                                  |
        __u32 enabled;                                                                                 |
};    

would it make it more obvious?

>
>> 
>> With this patchset we have two completely separate channels:
>> 1) Page-not-present goes through #PF and 'reason' in struct
>> kvm_vcpu_pv_apf_data.
>> 2) Page-ready goes through interrupt and 'pageready_token' in the same
>> kvm_vcpu_pv_apf_data.
>> 
>> >
>> >> +
>> >> +	Note, previously, type 2 (page present) events were delivered via the
>> >> +	same #PF exception as type 1 (page not present) events but this is
>> >> +	now deprecated.
>> >
>> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
>> >
>> > So all the old guests which were getting async pf will suddenly find
>> > that async pf does not work anymore (after hypervisor update). And
>> > some of them might report it as performance issue (if there were any
>> > performance benefits to be had with async pf).
>> 
>> We still do APF_HALT but generally yes, there might be some performance
>> implications. My RFC was preserving #PF path but the suggestion was to
>> retire it completely. (and I kinda like it because it makes a lot of
>> code go away)
>
> Ok. I don't have strong opinion here. If paolo likes it this way, so be
> it. :-)

APF is a slowpath for overcommited scenarios and when we switch to
APF_HALT we allow the host to run some other guest while PF is
processed. This is not the same from guest's perspective but from host's
we're fine as we're not wasting cycles.

>
>> 
>> >
>> > [..]
>> >>  
>> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> >>  {
>> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
>> >>  		return true;
>> >
>> > What does above mean. If async pf is not enabled, then it returns true,
>> > implying one can inject async page present. But if async pf is not
>> > enabled, there is no need to inject these events.
>> 
>> AFAIU this is a protection agains guest suddenly disabling APF
>> mechanism.
>
> Can we provide that protection in MSR implementation. That is once APF
> is enabled, it can't be disabled. Or it is a feature that we allow
> guest to disable APF and want it that way?

We need to allow to disable the feature. E.g. think about kdump
scenario, for example. Before we switch to kdump kernel we need to make
sure there's no more 'magic' memory which can suggenly change. Also,
kdump kernel may not even support APF so it will get very confused when
APF events get delivered.

>
>> What do we do with all the 'page ready' events after, we
>> can't deliver them anymore. So we just eat them (hoping guest will
>> unfreeze all processes on its own before disabling the mechanism).
>> 
>> It is the existing logic, my patch doesn't change it.
>
> I see its existing logic. Just it is very confusing and will be good
> if we can atleast explain it with some comments.
>
> I don't know what to make out of this.
>
> bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> {
>         if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>                 return true;
>         else
>                 return kvm_can_do_async_pf(vcpu);
> }
>
> If feature is disabled, then do inject async pf page present. If feature
> is enabled and check whether we can inject async pf right now or not.
>
> It probably will help if this check if feature being enabled/disabled
> is outside kvm_arch_can_inject_async_page_present() at the callsite
> of kvm_arch_can_inject_async_page_present() and there we explain that
> why it is important to inject page ready events despite the fact
> that feature is disabled.

This code would definitely love some comments added, will do in the next
version. And I'll also think how to improve the readability, thanks for
the feedback!

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-12 17:50         ` Sean Christopherson
@ 2020-05-13  9:09           ` Vitaly Kuznetsov
  2020-05-13 12:52           ` Vivek Goyal
  1 sibling, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-13  9:09 UTC (permalink / raw)
  To: Sean Christopherson, Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Jim Mattson,
	Gavin Shan, Peter Zijlstra, linux-kernel

Sean Christopherson <sean.j.christopherson@intel.com> writes:

>
> Why bother preserving backwards compatibility?  AIUI, both KVM and guest
> will support async #PF iff interrupt delivery is enabled.  Why not make
> the interrupt delivery approach KVM_ASYNC_PF_V2 and completely redefine the
> ABI?  E.g. to make it compatible with reflecting !PRESENT faults without a
> VM-Exit via Intel's EPT Violation #VE?
>

That's the plan, actually. 'page not present' becomes #VE and 'page
present' becomes an interrupt (we don't need to inject them
synchronously). These two parts are fairly independent but can be merged
to reuse the same new capability/CPUID.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-12 17:50         ` Sean Christopherson
  2020-05-13  9:09           ` Vitaly Kuznetsov
@ 2020-05-13 12:52           ` Vivek Goyal
  2020-05-15 15:59             ` Paolo Bonzini
  1 sibling, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-13 12:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, kvm, x86, Paolo Bonzini, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Tue, May 12, 2020 at 10:50:17AM -0700, Sean Christopherson wrote:
> On Tue, May 12, 2020 at 11:53:39AM -0400, Vivek Goyal wrote:
> > On Tue, May 12, 2020 at 05:40:10PM +0200, Vitaly Kuznetsov wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > 
> > > > On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> > > >> Currently, APF mechanism relies on the #PF abuse where the token is being
> > > >> passed through CR2. If we switch to using interrupts to deliver page-ready
> > > >> notifications we need a different way to pass the data. Extent the existing
> > > >> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> > > >> notifications.
> > > >> 
> > > >> The newly introduced apf_put_user_ready() temporary puts both reason
> > > >> and token information, this will be changed to put token only when we
> > > >> switch to interrupt based notifications.
> > > >> 
> > > >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > >> ---
> > > >>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> > > >>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
> > > >>  2 files changed, 15 insertions(+), 5 deletions(-)
> > > >> 
> > > >> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > > >> index 2a8e0b6b9805..e3602a1de136 100644
> > > >> --- a/arch/x86/include/uapi/asm/kvm_para.h
> > > >> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > > >> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
> > > >>  
> > > >>  struct kvm_vcpu_pv_apf_data {
> > > >>  	__u32 reason;
> > > >> -	__u8 pad[60];
> > > >> +	__u32 pageready_token;
> > > >
> > > > How about naming this just "token". That will allow me to deliver error
> > > > as well. pageready_token name seems to imply that this will always be
> > > > successful with page being ready.
> > > >
> > > > And reason will tell whether page could successfully be ready or
> > > > it was an error. And token will help us identify the task which
> > > > is waiting for the event.
> > > 
> > > I added 'pageready_' prefix to make it clear this is not used for 'page
> > > not present' notifications where we pass token through CR2. (BTW
> > > 'reason' also becomes a misnomer because we can only see
> > > 'KVM_PV_REASON_PAGE_NOT_PRESENT' there.)
> > 
> > Sure. I am just trying to keep names in such a way so that we could
> > deliver more events and not keep it too tightly coupled with only
> > two events (page not present, page ready).
> > 
> > > 
> > > I have no strong opinion, can definitely rename this to 'token' and add
> > > a line to the documentation to re-state that this is not used for type 1
> > > events.
> > 
> > I don't even know why are we calling "type 1" and "type 2" event. Calling
> > it KVM_PV_REASON_PAGE_NOT_PRESENT  and KVM_PV_REASON_PAGE_READY event
> > is much more intuitive. If somebody is confused about how event will
> > be delivered, that could be part of documentation. And "type1" and "type2"
> > does not say anything about delivery method anyway.
> > 
> > Also, type of event should not necessarily be tied to delivery method.
> > For example if we end up introducing say, "KVM_PV_REASON_PAGE_ERROR", then
> > I would think that event can be injected both using exception (#PF or #VE)
> > as well as interrupt (depending on state of system).
> 
> Why bother preserving backwards compatibility?

New machanism does not have to support old guests but old mechanism
should probably continue to work and deprecated slowly, IMHO. Otherwise
guests which were receiving async page faults will suddenly stop getting
it over hypervisor upgrade and possibly see drop in performance.

> AIUI, both KVM and guest
> will support async #PF iff interrupt delivery is enabled.  Why not make
> the interrupt delivery approach KVM_ASYNC_PF_V2 and completely redefine the
> ABI?

That makes sense to me. Probably leave existing ABI untouched and
deprecate it over a period of time and define V2 of ABI and new guests
use it.

> E.g. to make it compatible with reflecting !PRESENT faults without a
> VM-Exit via Intel's EPT Violation #VE?

IIUC, that's what paolo is planning, that is use #VE to inform guest
of page not present. It probably will be good if both #VE notification
and interrupt based page ready notifications happen at the same time
under V2 of ABI, IMHO.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-13  9:03         ` Vitaly Kuznetsov
@ 2020-05-13 13:53           ` Vivek Goyal
  2020-05-13 14:03             ` Vivek Goyal
  2020-05-13 14:23             ` Vitaly Kuznetsov
  0 siblings, 2 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-05-13 13:53 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Wed, May 13, 2020 at 11:03:48AM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> >
> >> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
> >> > type1 and type2 events, to me it makes sense to retain existing
> >> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
> >> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
> >> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
> >> > will be delivered using interrupt.
> >> 
> >> We use different fields for page-not-present and page-ready events so
> >> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
> >> to 'reason' we may accidentally destroy a 'page-not-present' event.
> >
> > This is confusing. So you mean at one point of time we might be using
> > same shared data structure for two events.
> >
> > - ->reason will be set to 1 and you will inject page_not_present
> >   execption.
> >
> > - If some page gets ready, you will now set ->token and queue 
> >   page ready exception. 
> >
> > Its very confusing. Can't we serialize the delivery of these events. So
> > that only one is in progress so that this structure is used by one event
> > at a time.
> 
> This is not how the mechanism (currently) works:
> - A process accesses a page which is swapped out
> 
> - We deliver synchronious APF (#PF) to the guest, it freezes the process
> and switches to another one.
> 
> - Another process accesses a swapped out page, APF is delivered and it
> also got frozen
> 
> ...
> 
> - At some point one of the previously unavailable pages become available
> (not necessarily the last or the first one) and we deliver this info via
> asynchronous APF (now interrupt).
> 
> - Note, after we deliver the interrupt and before it is actually
> consumed we may have another synchronous APF (#PF) event.
> 
> So we really need to separate memory locations for synchronous (type-1,
> #PF,...) and asynchronous (type-2, interrupt, ...) data.
> 
> The naming is unfortunate and misleading, I agree. What is currently
> named 'reason' should be something like 'apf_flag_for_pf' and it just
> means to distinguish real #PF from APF. This is going away in the
> future, we won't be abusing #PF anymore so I'd keep it as it is now,
> maybe add another comment somewhere.

Shouldn't we do #VE changes also at the same time. Otherwise we are
again creating this intermediate mode where synchronous notifications
happen using #PF. Once we move to #VE, old guests will again stop
getting async #PF? 

Also there is this issue of who is using this common shared area
between host and guest (struct kvm_vcpu_pv_apf_data) and how to 
use fields in this struct without breaking existing guests.

For now I see you have modified meaning of fields of this structure
(->reason) beacuse you are not concerned about older guets because
they will not receive async pf to begin with. If that's the case,
renaming even reason makes sense. Havind said that, as we are
planning to not use this data structure for synchronous notifications,
creating this intermediate mode is just going to create more
confusion. So in new ABI, we should not even try to make use of
->reason and design #VE changes at the same time. Core problem with
current ABI is racy access to ->reason and this patch set does not
get rid of that race anyway.

> The other location is
> 'pageready_token' and it actually contains the token. This is to stay
> long term so any suggestions for better naming are welcome.

pageready_token makes sense if this structure is being used both
types of notifications otherwise just ->token might be enough.

> 
> We could've separated these two memory locations completely and
> e.g. used the remaining 56 bits of MSR_KVM_ASYNC_PF_INT as the new
> location information. Maybe we should do that just to avoid the
> confusion.

If this is V2 of ABI, why not do "struct struct kvm_vcpu_pv_apf_data_v2"
instead? Only reason you seem to be sticking to existing structure and
->reason fields because of #PF based sync notifications which is planned
to go away soon. So if we do both the changes together, there is no
need to keep this confusion w.r.t ->reason and existing structure.

> 
> >
> > Also how do I extend it now to do error delivery. Please keep that in
> > mind. We don't want to be redesigning this stuff again. Its already
> > very complicated.
> >
> > I really need ->reason field to be usable in both the paths so that
> > error can be delivered.
> 
> If we want to use 'reason' for both we'll get into a weird scenario when
> exception is blocking interrupt and, what's more confusing, vice
> versa. I'd like to avoid this complexity in KVM code. My suggestion
> would be to rename 'reason' to something like 'pf_abuse_flag' so it
> doesn't cause the confusion and add new 'reason' after 'token'.
> 
> >
> > And this notion of same structure being shared across multiple events
> > at the same time is just going to create more confusion, IMHO. If we
> > can decouple it by serializing it, that definitely feels simpler to
> > understand.
> 
> What if we just add sub-structures to the structure, e.g. 
> 
> struct kvm_vcpu_pv_apf_data {
>         struct {
>             __u32 apf_flag;
>         } legacy_apf_data;
>         struct {
>             __u32 token;
>         } apf_interrupt_data;
>         ....
>         __u8 pad[56];                                                                                  |
>         __u32 enabled;                                                                                 |
> };    
> 
> would it make it more obvious?

To me this seems like an improvement. Just looking at it, it is clear
to me who is using what. But soon legacy_apf_data will be unused. Its
not clear to me when will be able to get rid of apf_flag and reuse it
for something else without having to worry about older guests.

This will be second best option, if for some reason we don't want to
do #VE changes at the same time. To make new design cleaner and more
understandable, doing #VE changes at the same time will be good.

> 
> >
> >> 
> >> With this patchset we have two completely separate channels:
> >> 1) Page-not-present goes through #PF and 'reason' in struct
> >> kvm_vcpu_pv_apf_data.
> >> 2) Page-ready goes through interrupt and 'pageready_token' in the same
> >> kvm_vcpu_pv_apf_data.
> >> 
> >> >
> >> >> +
> >> >> +	Note, previously, type 2 (page present) events were delivered via the
> >> >> +	same #PF exception as type 1 (page not present) events but this is
> >> >> +	now deprecated.
> >> >
> >> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
> >> >
> >> > So all the old guests which were getting async pf will suddenly find
> >> > that async pf does not work anymore (after hypervisor update). And
> >> > some of them might report it as performance issue (if there were any
> >> > performance benefits to be had with async pf).
> >> 
> >> We still do APF_HALT but generally yes, there might be some performance
> >> implications. My RFC was preserving #PF path but the suggestion was to
> >> retire it completely. (and I kinda like it because it makes a lot of
> >> code go away)
> >
> > Ok. I don't have strong opinion here. If paolo likes it this way, so be
> > it. :-)
> 
> APF is a slowpath for overcommited scenarios and when we switch to
> APF_HALT we allow the host to run some other guest while PF is
> processed. This is not the same from guest's perspective but from host's
> we're fine as we're not wasting cycles.
> 
> >
> >> 
> >> >
> >> > [..]
> >> >>  
> >> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> >> >>  {
> >> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> >> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
> >> >>  		return true;
> >> >
> >> > What does above mean. If async pf is not enabled, then it returns true,
> >> > implying one can inject async page present. But if async pf is not
> >> > enabled, there is no need to inject these events.
> >> 
> >> AFAIU this is a protection agains guest suddenly disabling APF
> >> mechanism.
> >
> > Can we provide that protection in MSR implementation. That is once APF
> > is enabled, it can't be disabled. Or it is a feature that we allow
> > guest to disable APF and want it that way?
> 
> We need to allow to disable the feature. E.g. think about kdump
> scenario, for example. Before we switch to kdump kernel we need to make
> sure there's no more 'magic' memory which can suggenly change.

So are we waiting to receive all page ready events before we switch
to kdump kernel? If yes, that's not good. Existing kernel is corrupted
and nothing meaningful can be done. So we should switch to next
kernel asap. Previous kernel is dying anyway, so we don't care to
receive page ready events.

> Also,
> kdump kernel may not even support APF so it will get very confused when
> APF events get delivered.

New kernel can just ignore these events if it does not support async
pf? 

This is somewhat similar to devices still doing interrupts in new
kernel. And solution for that seemed to be doing a "reset" of devices
in new kernel. We probably need similar logic where in new kernel
we simply disable "async pf" so that we don't get new notifications.

IOW, waiting in crashed kernel for all page ready events does not
sound like a good option. It reduces the reliability of kdump
operation. 

> 
> >
> >> What do we do with all the 'page ready' events after, we
> >> can't deliver them anymore. So we just eat them (hoping guest will
> >> unfreeze all processes on its own before disabling the mechanism).
> >> 
> >> It is the existing logic, my patch doesn't change it.
> >
> > I see its existing logic. Just it is very confusing and will be good
> > if we can atleast explain it with some comments.
> >
> > I don't know what to make out of this.
> >
> > bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> > {
> >         if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> >                 return true;
> >         else
> >                 return kvm_can_do_async_pf(vcpu);
> > }
> >
> > If feature is disabled, then do inject async pf page present. If feature
> > is enabled and check whether we can inject async pf right now or not.
> >
> > It probably will help if this check if feature being enabled/disabled
> > is outside kvm_arch_can_inject_async_page_present() at the callsite
> > of kvm_arch_can_inject_async_page_present() and there we explain that
> > why it is important to inject page ready events despite the fact
> > that feature is disabled.
> 
> This code would definitely love some comments added, will do in the next
> version. And I'll also think how to improve the readability, thanks for
> the feedback!

It definitely needs a comment (if it makes sense to still send page
ready events to guest).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-13 13:53           ` Vivek Goyal
@ 2020-05-13 14:03             ` Vivek Goyal
  2020-05-13 14:23             ` Vitaly Kuznetsov
  1 sibling, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-05-13 14:03 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Wed, May 13, 2020 at 09:53:50AM -0400, Vivek Goyal wrote:

[..]
> > > And this notion of same structure being shared across multiple events
> > > at the same time is just going to create more confusion, IMHO. If we
> > > can decouple it by serializing it, that definitely feels simpler to
> > > understand.
> > 
> > What if we just add sub-structures to the structure, e.g. 
> > 
> > struct kvm_vcpu_pv_apf_data {
> >         struct {
> >             __u32 apf_flag;
> >         } legacy_apf_data;
> >         struct {
> >             __u32 token;
> >         } apf_interrupt_data;
> >         ....
> >         __u8 pad[56];                                                                                  |
> >         __u32 enabled;                                                                                 |
> > };    
> > 
> > would it make it more obvious?

On a second thought, given we are not planning to use
this structure for synchrous events anymore, I think defining
struct might be overkill. May be a simple comment will do.

struct kvm_vcpu_pv_apf_data {
	/* Used by page fault based page not present notifications. Soon
	 * it will be legacy
	 */
	__u32 apf_flag;
	/* Used for interrupt based page ready notifications */
	__u32 token;
	...
	...
}

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications
  2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
                   ` (8 preceding siblings ...)
  2020-05-12 15:32 ` [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vivek Goyal
@ 2020-05-13 14:16 ` Vivek Goyal
  2020-05-14 18:14   ` Vitaly Kuznetsov
  9 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-13 14:16 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Mon, May 11, 2020 at 06:47:44PM +0200, Vitaly Kuznetsov wrote:
> Concerns were expressed around (ab)using #PF for KVM's async_pf mechanism,
> it seems that re-using #PF exception for a PV mechanism wasn't a great
> idea after all. The Grand Plan is to switch to using e.g. #VE for 'page
> not present' events and normal APIC interrupts for 'page ready' events.
> This series does the later.

Hi Vitaly,

How does any of this impact nested virtualization code (if any).

I have tried understanding that logic, but I have to admit, I could
never get it.

arch/x86/kvm/mmu/mmu.c

int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
                                u64 fault_address, char *insn, int insn_len)
{
        switch (vcpu->arch.apf.host_apf_reason) {
		case KVM_PV_REASON_PAGE_NOT_PRESENT:
			kvm_async_pf_task_wait(fault_address, 0);
		case KVM_PV_REASON_PAGE_READY:
			kvm_async_pf_task_wake(fault_address);
	}
}

Vivek

> 
> Changes since RFC:
> - Using #PF for 'page ready' is deprecated and removed [Paolo Bonzini]
> - 'reason' field in 'struct kvm_vcpu_pv_apf_data' is not used for 'page ready'
>   notifications and 'pageready_token' is not used for 'page not present' events
>   [Paolo Bonzini]
> - Renamed MSR_KVM_ASYNC_PF2 -> MSR_KVM_ASYNC_PF_INT [Peter Xu]
> - Drop 'enabled' field from MSR_KVM_ASYNC_PF_INT [Peter Xu]
> - Other minor changes supporting the above.
> 
> Vitaly Kuznetsov (8):
>   Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and
>     "Page Ready" exceptions simultaneously"
>   KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
>   KVM: introduce kvm_read_guest_offset_cached()
>   KVM: x86: interrupt based APF page-ready event delivery
>   KVM: x86: acknowledgment mechanism for async pf page ready
>     notifications
>   KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
>   KVM: x86: Switch KVM guest to using interrupts for page ready APF
>     delivery
>   KVM: x86: drop KVM_PV_REASON_PAGE_READY case from
>     kvm_handle_page_fault()
> 
>  Documentation/virt/kvm/cpuid.rst     |   6 ++
>  Documentation/virt/kvm/msr.rst       | 106 ++++++++++++++------
>  arch/s390/include/asm/kvm_host.h     |   2 +
>  arch/x86/entry/entry_32.S            |   5 +
>  arch/x86/entry/entry_64.S            |   5 +
>  arch/x86/include/asm/hardirq.h       |   3 +
>  arch/x86/include/asm/irq_vectors.h   |   6 +-
>  arch/x86/include/asm/kvm_host.h      |   7 +-
>  arch/x86/include/asm/kvm_para.h      |   6 ++
>  arch/x86/include/uapi/asm/kvm_para.h |  11 ++-
>  arch/x86/kernel/irq.c                |   9 ++
>  arch/x86/kernel/kvm.c                |  42 ++++++--
>  arch/x86/kvm/cpuid.c                 |   3 +-
>  arch/x86/kvm/mmu/mmu.c               |  10 +-
>  arch/x86/kvm/x86.c                   | 142 ++++++++++++++++++---------
>  include/linux/kvm_host.h             |   3 +
>  include/uapi/linux/kvm.h             |   1 +
>  virt/kvm/async_pf.c                  |  10 ++
>  virt/kvm/kvm_main.c                  |  19 +++-
>  19 files changed, 295 insertions(+), 101 deletions(-)
> 
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-13 13:53           ` Vivek Goyal
  2020-05-13 14:03             ` Vivek Goyal
@ 2020-05-13 14:23             ` Vitaly Kuznetsov
  2020-05-13 18:46               ` Vivek Goyal
  1 sibling, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-13 14:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Wed, May 13, 2020 at 11:03:48AM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
>> >> Vivek Goyal <vgoyal@redhat.com> writes:
>> >> 
>> >> >
>> >> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
>> >> > type1 and type2 events, to me it makes sense to retain existing
>> >> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
>> >> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
>> >> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
>> >> > will be delivered using interrupt.
>> >> 
>> >> We use different fields for page-not-present and page-ready events so
>> >> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
>> >> to 'reason' we may accidentally destroy a 'page-not-present' event.
>> >
>> > This is confusing. So you mean at one point of time we might be using
>> > same shared data structure for two events.
>> >
>> > - ->reason will be set to 1 and you will inject page_not_present
>> >   execption.
>> >
>> > - If some page gets ready, you will now set ->token and queue 
>> >   page ready exception. 
>> >
>> > Its very confusing. Can't we serialize the delivery of these events. So
>> > that only one is in progress so that this structure is used by one event
>> > at a time.
>> 
>> This is not how the mechanism (currently) works:
>> - A process accesses a page which is swapped out
>> 
>> - We deliver synchronious APF (#PF) to the guest, it freezes the process
>> and switches to another one.
>> 
>> - Another process accesses a swapped out page, APF is delivered and it
>> also got frozen
>> 
>> ...
>> 
>> - At some point one of the previously unavailable pages become available
>> (not necessarily the last or the first one) and we deliver this info via
>> asynchronous APF (now interrupt).
>> 
>> - Note, after we deliver the interrupt and before it is actually
>> consumed we may have another synchronous APF (#PF) event.
>> 
>> So we really need to separate memory locations for synchronous (type-1,
>> #PF,...) and asynchronous (type-2, interrupt, ...) data.
>> 
>> The naming is unfortunate and misleading, I agree. What is currently
>> named 'reason' should be something like 'apf_flag_for_pf' and it just
>> means to distinguish real #PF from APF. This is going away in the
>> future, we won't be abusing #PF anymore so I'd keep it as it is now,
>> maybe add another comment somewhere.
>
> Shouldn't we do #VE changes also at the same time. Otherwise we are
> again creating this intermediate mode where synchronous notifications
> happen using #PF. Once we move to #VE, old guests will again stop
> getting async #PF? 
>
> Also there is this issue of who is using this common shared area
> between host and guest (struct kvm_vcpu_pv_apf_data) and how to 
> use fields in this struct without breaking existing guests.
>
> For now I see you have modified meaning of fields of this structure
> (->reason) beacuse you are not concerned about older guets because
> they will not receive async pf to begin with. If that's the case,
> renaming even reason makes sense. Havind said that, as we are
> planning to not use this data structure for synchronous notifications,
> creating this intermediate mode is just going to create more
> confusion. So in new ABI, we should not even try to make use of
> ->reason and design #VE changes at the same time. Core problem with
> current ABI is racy access to ->reason and this patch set does not
> get rid of that race anyway.

Well, at least 'page ready' notifications don't access reason and can't
collide with real #PF which is already an improvement :-)

>
>> The other location is
>> 'pageready_token' and it actually contains the token. This is to stay
>> long term so any suggestions for better naming are welcome.
>
> pageready_token makes sense if this structure is being used both
> types of notifications otherwise just ->token might be enough.

Yes, sure, when we stop using this structure for type-1 APF 'pageready_'
prefix becomes superfluous.

>
>> 
>> We could've separated these two memory locations completely and
>> e.g. used the remaining 56 bits of MSR_KVM_ASYNC_PF_INT as the new
>> location information. Maybe we should do that just to avoid the
>> confusion.
>
> If this is V2 of ABI, why not do "struct struct kvm_vcpu_pv_apf_data_v2"
> instead? Only reason you seem to be sticking to existing structure and
> ->reason fields because of #PF based sync notifications which is planned
> to go away soon. So if we do both the changes together, there is no
> need to keep this confusion w.r.t ->reason and existing structure.
>

Yes, sure. I don't know if Paolo already has #VE patches in his stash
but we can definitely keep this on the list untill the second half is
done, no rush.

>> 
>> >
>> > Also how do I extend it now to do error delivery. Please keep that in
>> > mind. We don't want to be redesigning this stuff again. Its already
>> > very complicated.
>> >
>> > I really need ->reason field to be usable in both the paths so that
>> > error can be delivered.
>> 
>> If we want to use 'reason' for both we'll get into a weird scenario when
>> exception is blocking interrupt and, what's more confusing, vice
>> versa. I'd like to avoid this complexity in KVM code. My suggestion
>> would be to rename 'reason' to something like 'pf_abuse_flag' so it
>> doesn't cause the confusion and add new 'reason' after 'token'.
>> 
>> >
>> > And this notion of same structure being shared across multiple events
>> > at the same time is just going to create more confusion, IMHO. If we
>> > can decouple it by serializing it, that definitely feels simpler to
>> > understand.
>> 
>> What if we just add sub-structures to the structure, e.g. 
>> 
>> struct kvm_vcpu_pv_apf_data {
>>         struct {
>>             __u32 apf_flag;
>>         } legacy_apf_data;
>>         struct {
>>             __u32 token;
>>         } apf_interrupt_data;
>>         ....
>>         __u8 pad[56];                                                                                  |
>>         __u32 enabled;                                                                                 |
>> };    
>> 
>> would it make it more obvious?
>
> To me this seems like an improvement. Just looking at it, it is clear
> to me who is using what. But soon legacy_apf_data will be unused. Its
> not clear to me when will be able to get rid of apf_flag and reuse it
> for something else without having to worry about older guests.
>
> This will be second best option, if for some reason we don't want to
> do #VE changes at the same time. To make new design cleaner and more
> understandable, doing #VE changes at the same time will be good.

No objections :-)

>
>> 
>> >
>> >> 
>> >> With this patchset we have two completely separate channels:
>> >> 1) Page-not-present goes through #PF and 'reason' in struct
>> >> kvm_vcpu_pv_apf_data.
>> >> 2) Page-ready goes through interrupt and 'pageready_token' in the same
>> >> kvm_vcpu_pv_apf_data.
>> >> 
>> >> >
>> >> >> +
>> >> >> +	Note, previously, type 2 (page present) events were delivered via the
>> >> >> +	same #PF exception as type 1 (page not present) events but this is
>> >> >> +	now deprecated.
>> >> >
>> >> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
>> >> >
>> >> > So all the old guests which were getting async pf will suddenly find
>> >> > that async pf does not work anymore (after hypervisor update). And
>> >> > some of them might report it as performance issue (if there were any
>> >> > performance benefits to be had with async pf).
>> >> 
>> >> We still do APF_HALT but generally yes, there might be some performance
>> >> implications. My RFC was preserving #PF path but the suggestion was to
>> >> retire it completely. (and I kinda like it because it makes a lot of
>> >> code go away)
>> >
>> > Ok. I don't have strong opinion here. If paolo likes it this way, so be
>> > it. :-)
>> 
>> APF is a slowpath for overcommited scenarios and when we switch to
>> APF_HALT we allow the host to run some other guest while PF is
>> processed. This is not the same from guest's perspective but from host's
>> we're fine as we're not wasting cycles.
>> 
>> >
>> >> 
>> >> >
>> >> > [..]
>> >> >>  
>> >> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> >> >>  {
>> >> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> >> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
>> >> >>  		return true;
>> >> >
>> >> > What does above mean. If async pf is not enabled, then it returns true,
>> >> > implying one can inject async page present. But if async pf is not
>> >> > enabled, there is no need to inject these events.
>> >> 
>> >> AFAIU this is a protection agains guest suddenly disabling APF
>> >> mechanism.
>> >
>> > Can we provide that protection in MSR implementation. That is once APF
>> > is enabled, it can't be disabled. Or it is a feature that we allow
>> > guest to disable APF and want it that way?
>> 
>> We need to allow to disable the feature. E.g. think about kdump
>> scenario, for example. Before we switch to kdump kernel we need to make
>> sure there's no more 'magic' memory which can suggenly change.
>
> So are we waiting to receive all page ready events before we switch
> to kdump kernel? If yes, that's not good. Existing kernel is corrupted
> and nothing meaningful can be done. So we should switch to next
> kernel asap. Previous kernel is dying anyway, so we don't care to
> receive page ready events.

No, we don't wait. We just disable the feature so no new notifications
will get delivered (if they are the new kernel will get confused). If we
happen to access one of the previously-unavailable pages again the host
will just freeze us untill the page is ready, no harm done.

>
>> Also,
>> kdump kernel may not even support APF so it will get very confused when
>> APF events get delivered.
>
> New kernel can just ignore these events if it does not support async
> pf? 
>
> This is somewhat similar to devices still doing interrupts in new
> kernel. And solution for that seemed to be doing a "reset" of devices
> in new kernel. We probably need similar logic where in new kernel
> we simply disable "async pf" so that we don't get new notifications.

Right and that's what we're doing - just disabling new notifications.

>
> IOW, waiting in crashed kernel for all page ready events does not
> sound like a good option. It reduces the reliability of kdump
> operation. 

AFAIK we don't wait (kvm_pv_guest_cpu_reboot() -> kvm_pv_disable_apf()
-> wrmsrl()).

>
>> 
>> >
>> >> What do we do with all the 'page ready' events after, we
>> >> can't deliver them anymore. So we just eat them (hoping guest will
>> >> unfreeze all processes on its own before disabling the mechanism).
>> >> 
>> >> It is the existing logic, my patch doesn't change it.
>> >
>> > I see its existing logic. Just it is very confusing and will be good
>> > if we can atleast explain it with some comments.
>> >
>> > I don't know what to make out of this.
>> >
>> > bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> > {
>> >         if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> >                 return true;
>> >         else
>> >                 return kvm_can_do_async_pf(vcpu);
>> > }
>> >
>> > If feature is disabled, then do inject async pf page present. If feature
>> > is enabled and check whether we can inject async pf right now or not.
>> >
>> > It probably will help if this check if feature being enabled/disabled
>> > is outside kvm_arch_can_inject_async_page_present() at the callsite
>> > of kvm_arch_can_inject_async_page_present() and there we explain that
>> > why it is important to inject page ready events despite the fact
>> > that feature is disabled.
>> 
>> This code would definitely love some comments added, will do in the next
>> version. And I'll also think how to improve the readability, thanks for
>> the feedback!
>
> It definitely needs a comment (if it makes sense to still send page
> ready events to guest).
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-13 14:23             ` Vitaly Kuznetsov
@ 2020-05-13 18:46               ` Vivek Goyal
  2020-05-14  8:08                 ` Vitaly Kuznetsov
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-13 18:46 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Wed, May 13, 2020 at 04:23:55PM +0200, Vitaly Kuznetsov wrote:

[..]
> >> Also,
> >> kdump kernel may not even support APF so it will get very confused when
> >> APF events get delivered.
> >
> > New kernel can just ignore these events if it does not support async
> > pf? 
> >
> > This is somewhat similar to devices still doing interrupts in new
> > kernel. And solution for that seemed to be doing a "reset" of devices
> > in new kernel. We probably need similar logic where in new kernel
> > we simply disable "async pf" so that we don't get new notifications.
> 
> Right and that's what we're doing - just disabling new notifications.

Nice.

So why there is a need to deliver "page ready" notifications
to guest after guest has disabled async pf. Atleast kdump does not
seem to need it. It will boot into second kernel anyway, irrespective
of the fact whether it receives page ready or not.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-13 18:46               ` Vivek Goyal
@ 2020-05-14  8:08                 ` Vitaly Kuznetsov
  2020-05-14 13:31                   ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14  8:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Wed, May 13, 2020 at 04:23:55PM +0200, Vitaly Kuznetsov wrote:
>
> [..]
>> >> Also,
>> >> kdump kernel may not even support APF so it will get very confused when
>> >> APF events get delivered.
>> >
>> > New kernel can just ignore these events if it does not support async
>> > pf? 
>> >
>> > This is somewhat similar to devices still doing interrupts in new
>> > kernel. And solution for that seemed to be doing a "reset" of devices
>> > in new kernel. We probably need similar logic where in new kernel
>> > we simply disable "async pf" so that we don't get new notifications.
>> 
>> Right and that's what we're doing - just disabling new notifications.
>
> Nice.
>
> So why there is a need to deliver "page ready" notifications
> to guest after guest has disabled async pf. Atleast kdump does not
> seem to need it. It will boot into second kernel anyway, irrespective
> of the fact whether it receives page ready or not.

We don't deliver anything to the guest after it disables APF (neither
'page ready' for what was previously missing, nor 'page not ready' for
new faults), kvm_arch_can_inject_async_page_present() is just another
misnomer, it should be named something like
'kvm_arch_can_unqueue_async_page_present()' meaning that 'page ready'
notification can be 'unqueued' from internal KVM queue. We will either
deliver it (when guest has APF enabled) or just drop it (when guest has
APF disabled). The only case when it has to stay in the queue is when
guest has APF enabled and the slot is still busy (so it didn't get to
process a previously delivered notification). We will try to deliver it
again after guest writes to MSR_KVM_ASYNC_PF_ACK.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery
  2020-05-14  8:08                 ` Vitaly Kuznetsov
@ 2020-05-14 13:31                   ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-05-14 13:31 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Thu, May 14, 2020 at 10:08:37AM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Wed, May 13, 2020 at 04:23:55PM +0200, Vitaly Kuznetsov wrote:
> >
> > [..]
> >> >> Also,
> >> >> kdump kernel may not even support APF so it will get very confused when
> >> >> APF events get delivered.
> >> >
> >> > New kernel can just ignore these events if it does not support async
> >> > pf? 
> >> >
> >> > This is somewhat similar to devices still doing interrupts in new
> >> > kernel. And solution for that seemed to be doing a "reset" of devices
> >> > in new kernel. We probably need similar logic where in new kernel
> >> > we simply disable "async pf" so that we don't get new notifications.
> >> 
> >> Right and that's what we're doing - just disabling new notifications.
> >
> > Nice.
> >
> > So why there is a need to deliver "page ready" notifications
> > to guest after guest has disabled async pf. Atleast kdump does not
> > seem to need it. It will boot into second kernel anyway, irrespective
> > of the fact whether it receives page ready or not.
> 
> We don't deliver anything to the guest after it disables APF (neither
> 'page ready' for what was previously missing, nor 'page not ready' for
> new faults), kvm_arch_can_inject_async_page_present() is just another
> misnomer, it should be named something like
> 'kvm_arch_can_unqueue_async_page_present()' meaning that 'page ready'
> notification can be 'unqueued' from internal KVM queue. We will either
> deliver it (when guest has APF enabled) or just drop it (when guest has
> APF disabled). The only case when it has to stay in the queue is when
> guest has APF enabled and the slot is still busy (so it didn't get to
> process a previously delivered notification). We will try to deliver it
> again after guest writes to MSR_KVM_ASYNC_PF_ACK.

This makes sense. Renaming this function to make it more clear will
help understanding code better.

Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications
  2020-05-13 14:16 ` Vivek Goyal
@ 2020-05-14 18:14   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-14 18:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, May 11, 2020 at 06:47:44PM +0200, Vitaly Kuznetsov wrote:
>> Concerns were expressed around (ab)using #PF for KVM's async_pf mechanism,
>> it seems that re-using #PF exception for a PV mechanism wasn't a great
>> idea after all. The Grand Plan is to switch to using e.g. #VE for 'page
>> not present' events and normal APIC interrupts for 'page ready' events.
>> This series does the later.
>
> Hi Vitaly,
>
> How does any of this impact nested virtualization code (if any).
>
> I have tried understanding that logic, but I have to admit, I could
> never get it.
>
> arch/x86/kvm/mmu/mmu.c
>
> int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>                                 u64 fault_address, char *insn, int insn_len)
> {
>         switch (vcpu->arch.apf.host_apf_reason) {
> 		case KVM_PV_REASON_PAGE_NOT_PRESENT:
> 			kvm_async_pf_task_wait(fault_address, 0);
> 		case KVM_PV_REASON_PAGE_READY:
> 			kvm_async_pf_task_wake(fault_address);
> 	}
> }
>

"[PATCH 8/8] KVM: x86: drop KVM_PV_REASON_PAGE_READY case from
kvm_handle_page_fault()" modifies this a little bit.

Basically (and if I understand this correctly) we have the following APF
related feature (bit 2 in MSR_KVM_ASYNC_PF_EN): "asynchronous page faults
are delivered to L1 as #PF vmexits.". When enabled, it allows L0 to
inject #PF when L2 guest is running. L1 will see this as '#PF vmexit'
and the code you cite will do exactly what do_async_page_fault() is
doing.

When we switch to interrupt based delivery for 'page ready' events we
don't need a special handling for them in L1 (as we don't need any
special handling for all interrupts from devices in kernel when KVM
guest is running).

I have to admit I haven't tested nested scenario yet, "what could go
wrong?" :-)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-13 12:52           ` Vivek Goyal
@ 2020-05-15 15:59             ` Paolo Bonzini
  2020-05-15 18:46               ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Bonzini @ 2020-05-15 15:59 UTC (permalink / raw)
  To: Vivek Goyal, Sean Christopherson
  Cc: Vitaly Kuznetsov, kvm, x86, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Jim Mattson,
	Gavin Shan, Peter Zijlstra, linux-kernel

On 13/05/20 14:52, Vivek Goyal wrote:
>>> Also, type of event should not necessarily be tied to delivery method.
>>> For example if we end up introducing say, "KVM_PV_REASON_PAGE_ERROR", then
>>> I would think that event can be injected both using exception (#PF or #VE)
>>> as well as interrupt (depending on state of system).
>> Why bother preserving backwards compatibility?
> New machanism does not have to support old guests but old mechanism
> should probably continue to work and deprecated slowly, IMHO. Otherwise
> guests which were receiving async page faults will suddenly stop getting
> it over hypervisor upgrade and possibly see drop in performance.

Unfortunately, the old mechanism was broken enough, and in enough
different ways, that it's better to just drop it.

The new one using #VE is not coming very soon (we need to emulate it for
<Broadwell and AMD processors, so it's not entirely trivial) so we are
going to keep "page not ready" delivery using #PF for some time or even
forever.  However, page ready notification as #PF is going away for good.

That said, type1/type2 is quite bad. :)  Let's change that to page not
present / page ready.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 15:59             ` Paolo Bonzini
@ 2020-05-15 18:46               ` Sean Christopherson
  2020-05-15 19:18                 ` Paolo Bonzini
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2020-05-15 18:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vivek Goyal, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Fri, May 15, 2020 at 05:59:43PM +0200, Paolo Bonzini wrote:
> On 13/05/20 14:52, Vivek Goyal wrote:
> >>> Also, type of event should not necessarily be tied to delivery method.
> >>> For example if we end up introducing say, "KVM_PV_REASON_PAGE_ERROR", then
> >>> I would think that event can be injected both using exception (#PF or #VE)
> >>> as well as interrupt (depending on state of system).
> >> Why bother preserving backwards compatibility?
> > New machanism does not have to support old guests but old mechanism
> > should probably continue to work and deprecated slowly, IMHO. Otherwise
> > guests which were receiving async page faults will suddenly stop getting
> > it over hypervisor upgrade and possibly see drop in performance.
> 
> Unfortunately, the old mechanism was broken enough, and in enough
> different ways, that it's better to just drop it.
> 
> The new one using #VE is not coming very soon (we need to emulate it for
> <Broadwell and AMD processors, so it's not entirely trivial) so we are
> going to keep "page not ready" delivery using #PF for some time or even
> forever.  However, page ready notification as #PF is going away for good.

And isn't hardware based EPT Violation #VE going to require a completely
different protocol than what is implemented today?  For hardware based #VE,
KVM won't intercept the fault, i.e. the guest will need to make an explicit
hypercall to request the page.  That seems like it'll be as time consuming
to implement as emulating EPT Violation #VE in KVM.

> That said, type1/type2 is quite bad. :)  Let's change that to page not
> present / page ready.

Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
can (ab)use the error code to indicate an async #PF by setting it to an
impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
is even enforced by the SDM, which states:

  [SGX] this flag is set only if the P flag (bit 0) is 1 and the RSVD flag
  (bit 3) and the PK flag (bit 5) are both 0.

I.e. we've got bigger problems if hardware generates a !PRESENT, WRITE, RSVD,
PK, SGX page fault :-)

Then the page ready becomes the only guest-side consumer of the in-memory
struct, e.g. it can be renamed to something like kvm_vcpu_pv_apf_ready and
doesn't need a reason field (though it still needs a "busy" bit) as written.
It'd also eliminate the apf_put_user() in kvm_arch_async_page_not_present().

I believe it would also allow implementing (in the future) "async #PF ready"
as a ring buffer, i.e. allow kvm_check_async_pf_completion() to coalesce all
ready pages into a single injected interrupt.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 18:46               ` Sean Christopherson
@ 2020-05-15 19:18                 ` Paolo Bonzini
  2020-05-15 20:33                   ` Vivek Goyal
  2020-05-15 20:43                   ` Sean Christopherson
  0 siblings, 2 replies; 43+ messages in thread
From: Paolo Bonzini @ 2020-05-15 19:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vivek Goyal, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On 15/05/20 20:46, Sean Christopherson wrote:
>> The new one using #VE is not coming very soon (we need to emulate it for
>> <Broadwell and AMD processors, so it's not entirely trivial) so we are
>> going to keep "page not ready" delivery using #PF for some time or even
>> forever.  However, page ready notification as #PF is going away for good.
> 
> And isn't hardware based EPT Violation #VE going to require a completely
> different protocol than what is implemented today?  For hardware based #VE,
> KVM won't intercept the fault, i.e. the guest will need to make an explicit
> hypercall to request the page.

Yes, but it's a fairly simple hypercall to implement.

>> That said, type1/type2 is quite bad. :)  Let's change that to page not
>> present / page ready.
> 
> Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
> only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
> can (ab)use the error code to indicate an async #PF by setting it to an
> impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
> is even enforced by the SDM, which states:

Possibly, but it's water under the bridge now.  And the #PF mechanism
also has the problem with NMIs that happen before the error code is read
and page faults happening in the handler (you may connect some dots now).

For #VE, the virtualization exception data area is enough to hold all
the data that is needed.

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 19:18                 ` Paolo Bonzini
@ 2020-05-15 20:33                   ` Vivek Goyal
  2020-05-15 20:53                     ` Sean Christopherson
  2020-05-15 20:43                   ` Sean Christopherson
  1 sibling, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-15 20:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote:
> On 15/05/20 20:46, Sean Christopherson wrote:
> >> The new one using #VE is not coming very soon (we need to emulate it for
> >> <Broadwell and AMD processors, so it's not entirely trivial) so we are
> >> going to keep "page not ready" delivery using #PF for some time or even
> >> forever.  However, page ready notification as #PF is going away for good.
> > 
> > And isn't hardware based EPT Violation #VE going to require a completely
> > different protocol than what is implemented today?  For hardware based #VE,
> > KVM won't intercept the fault, i.e. the guest will need to make an explicit
> > hypercall to request the page.
> 
> Yes, but it's a fairly simple hypercall to implement.
> 
> >> That said, type1/type2 is quite bad. :)  Let's change that to page not
> >> present / page ready.
> > 
> > Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
> > only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
> > can (ab)use the error code to indicate an async #PF by setting it to an
> > impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
> > is even enforced by the SDM, which states:
> 
> Possibly, but it's water under the bridge now.
> And the #PF mechanism also has the problem with NMIs that happen before
> the error code is read
> and page faults happening in the handler (you may connect some dots now).

I understood that following was racy.

do_async_page_fault <--- kvm injected async page fault
  NMI happens (Before kvm_read_and_reset_pf_reason() is done)
   ->do_async_page_fault() (This is regular page fault but it will read
   			    reason from shared area and will treat itself
			    as async page fault)

So this is racy.

But if we get rid of the notion of reading from shared region in page
fault handler, will we not get rid of this race.

I am assuming that error_code is not racy as it is pushed on stack.
What am I missing.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 19:18                 ` Paolo Bonzini
  2020-05-15 20:33                   ` Vivek Goyal
@ 2020-05-15 20:43                   ` Sean Christopherson
  2020-05-15 22:23                     ` Paolo Bonzini
  1 sibling, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2020-05-15 20:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vivek Goyal, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote:
> On 15/05/20 20:46, Sean Christopherson wrote:
> > Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
> > only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
> > can (ab)use the error code to indicate an async #PF by setting it to an
> > impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
> > is even enforced by the SDM, which states:
> 
> Possibly, but it's water under the bridge now.

Why is that?  I thought we were redoing the entire thing because the current
ABI is unfixably broken?  In other words, since the guest needs to change,
why are we keeping any of the current async #PF pieces?  E.g. why keep using
#PF instead of usurping something like #NP?

> And the #PF mechanism also has the problem with NMIs that happen before the
> error code is read and page faults happening in the handler.

Hrm, I think there's no unfixable problem except for a pathological
#PF->NMI->#DB->#PF scenario.  But it is a problem :-(

FWIW, the error code isn't a problem, CR2 is the killer.  The issue Andy
originally pointed out is

  #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT)
     NMI: before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT
       #PF: normal page fault (NMI handler accesses user memory, e.g. perf)

With current async #PF, the problem is that CR2 and apf_reason are out of
sync, not that CR2 or the error code are lost.  E.g. the above could also
happen with a regular #PF on both ends, and obviously that works just fine.

In other words, the error code doesn't suffer the same problem because it's
pushed on the stack, not shoved into a static memory location.

CR2 is the real problem, even though it's saved by the NMI handler.  The
simple case where the NMI handler hits an async #PF before it can save CR2
is avoidable by KVM not injecting #PF if NMIs are blocked.  The pathological
case would be if there's a #DB at the beginning of the NMI handler; the IRET
from the #DB would unblock NMIs and then open up the guest to hitting an
async #PF on the NMI handler before CR2 is saved by the guest. :-(

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 20:33                   ` Vivek Goyal
@ 2020-05-15 20:53                     ` Sean Christopherson
  0 siblings, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2020-05-15 20:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Paolo Bonzini, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Fri, May 15, 2020 at 04:33:52PM -0400, Vivek Goyal wrote:
> On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote:
> > On 15/05/20 20:46, Sean Christopherson wrote:
> > >> The new one using #VE is not coming very soon (we need to emulate it for
> > >> <Broadwell and AMD processors, so it's not entirely trivial) so we are
> > >> going to keep "page not ready" delivery using #PF for some time or even
> > >> forever.  However, page ready notification as #PF is going away for good.
> > > 
> > > And isn't hardware based EPT Violation #VE going to require a completely
> > > different protocol than what is implemented today?  For hardware based #VE,
> > > KVM won't intercept the fault, i.e. the guest will need to make an explicit
> > > hypercall to request the page.
> > 
> > Yes, but it's a fairly simple hypercall to implement.
> > 
> > >> That said, type1/type2 is quite bad. :)  Let's change that to page not
> > >> present / page ready.
> > > 
> > > Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
> > > only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
> > > can (ab)use the error code to indicate an async #PF by setting it to an
> > > impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
> > > is even enforced by the SDM, which states:
> > 
> > Possibly, but it's water under the bridge now.
> > And the #PF mechanism also has the problem with NMIs that happen before
> > the error code is read
> > and page faults happening in the handler (you may connect some dots now).
> 
> I understood that following was racy.
> 
> do_async_page_fault <--- kvm injected async page fault
>   NMI happens (Before kvm_read_and_reset_pf_reason() is done)
>    ->do_async_page_fault() (This is regular page fault but it will read
>    			    reason from shared area and will treat itself
> 			    as async page fault)
> 
> So this is racy.
> 
> But if we get rid of the notion of reading from shared region in page
> fault handler, will we not get rid of this race.
> 
> I am assuming that error_code is not racy as it is pushed on stack.
> What am I missing.

Nothing, AFAICT.  As I mentioned in a different mail, CR2 can be squished,
but I don't see how error code can be lost.

But, because CR2 can be squished, there still needs to be an in-memory busy
flag even if error code is used as the host #PF indicator, otherwise the
guest could lose one of the tokens.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 20:43                   ` Sean Christopherson
@ 2020-05-15 22:23                     ` Paolo Bonzini
  2020-05-15 23:16                       ` Sean Christopherson
  2020-05-21 14:59                       ` Vitaly Kuznetsov
  0 siblings, 2 replies; 43+ messages in thread
From: Paolo Bonzini @ 2020-05-15 22:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vivek Goyal, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel,
	Andrew Cooper

On 15/05/20 22:43, Sean Christopherson wrote:
> On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote:
>> On 15/05/20 20:46, Sean Christopherson wrote:
>>> Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
>>> only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
>>> can (ab)use the error code to indicate an async #PF by setting it to an
>>> impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
>>> is even enforced by the SDM, which states:
>>
>> Possibly, but it's water under the bridge now.
> 
> Why is that?  I thought we were redoing the entire thing because the current
> ABI is unfixably broken?  In other words, since the guest needs to change,
> why are we keeping any of the current async #PF pieces?  E.g. why keep using
> #PF instead of usurping something like #NP?

Because that would be 3 ABIs to support instead of 2.  The #PF solution
is only broken as long as you allow async PF from ring 0 (which wasn't
even true except for preemptable kernels) _and_ have NMIs that can
generate page faults.  We also have the #PF vmexit part for nested
virtualization.  This adds up and makes a quick fix for 'page not ready'
notifications not that quick.

However, interrupts for 'page ready' do have a bunch of advantages (more
control on what can be preempted by the notification, a saner check for
new page faults which is effectively a bug fix) so it makes sense to get
them in more quickly (probably 5.9 at this point due to the massive
cleanups that are being done around interrupt vectors).

>> And the #PF mechanism also has the problem with NMIs that happen before the
>> error code is read and page faults happening in the handler.
> 
> Hrm, I think there's no unfixable problem except for a pathological
> #PF->NMI->#DB->#PF scenario.  But it is a problem :-(

Yeah, that made no sense.  But still I'm not sure the x86 maintainers
would like it.

The only possible isue with #VE is the re-entrancy at the end.  Andy
proposed re-enabling it from an interrupt, but here is one solution that
can be done almost entirely from C.  The idea is to split the IST in two
halves, and flip between them in the TSS with an XOR operation on entry
to the interrupt handler.  This is possible because there won't ever be
more than two handlers active at the same time.  Unlike if you used
SUB/ADD, with XOR you don't have to restore the IST on exit: the two
halves will take turns as the current IST and there's no problematic
window between the ADD and the IRET.

The pseudocode would be:

- on #VE entry
   xor 512 with the IST address in the TSS
   check if saved RSP comes from the IST
   if so:
     overwrite the saved flags/CS/SS in the "other" IST half with the
       current value of the registers
     overwrite the saved RSP in the "other" IST half with the address
       of the top of the IST itself
     overwrite the saved RIP in the "other" IST half with the address
       of a trampoline (see below)
   else:
     save the top 5 words of the IST somewhere
     do normal stuff

- the trampoline restores the 5 words at the top of the IST with five
  push instructions, and jumps back to the first instruction of the
  handler

Everything except the first step can even be done in C.

Here is an example.

Assuming that on entry to the outer #VE the IST is the "even" half, the
outer #VE moves the IST to the "odd" half and the return
flags/CS/SS/RSP/RIP are saved.

After the reentrancy flag has been cleared, a nested #VE arrives and
runs within the "odd" half of the IST.  The IST is moved back to the
"even" half and the return flags/CS/SS/RSP/RIP in the "even" half are
patched to point to the trampoline.

When we get back to the outer handler the reentrancy flag not zero, so
even though the IST points to the current stack, reentrancy is
impossible and we go just fine through the few final instructions of the
handler.

On outer #VE exit, the IRET instruction jumps to the trampoline, with
RSP pointing at the top of the "even" half.  The return
flags/CS/SS/RSP/RIP are restored, and everything restarts from the
beginning: the outer #VE moves the IST to the "odd" half, the return
flags/CS/SS/RSP/RIP are saved, the data for the nested #VE is fished out
of the virtualization exception area and so on.

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 22:23                     ` Paolo Bonzini
@ 2020-05-15 23:16                       ` Sean Christopherson
  2020-05-21 14:59                       ` Vitaly Kuznetsov
  1 sibling, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2020-05-15 23:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vivek Goyal, Vitaly Kuznetsov, kvm, x86, Andy Lutomirski,
	Thomas Gleixner, Borislav Petkov, H. Peter Anvin, Wanpeng Li,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel,
	Andrew Cooper

On Sat, May 16, 2020 at 12:23:31AM +0200, Paolo Bonzini wrote:
> On 15/05/20 22:43, Sean Christopherson wrote:
> > On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote:
> >> On 15/05/20 20:46, Sean Christopherson wrote:
> >>> Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case?  VMX
> >>> only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we
> >>> can (ab)use the error code to indicate an async #PF by setting it to an
> >>> impossible value, e.g. 0xaaaa (a is for async!).  That partciular error code
> >>> is even enforced by the SDM, which states:
> >>
> >> Possibly, but it's water under the bridge now.
> > 
> > Why is that?  I thought we were redoing the entire thing because the current
> > ABI is unfixably broken?  In other words, since the guest needs to change,
> > why are we keeping any of the current async #PF pieces?  E.g. why keep using
> > #PF instead of usurping something like #NP?
> 
> Because that would be 3 ABIs to support instead of 2.  The #PF solution
> is only broken as long as you allow async PF from ring 0 (which wasn't
> even true except for preemptable kernels) _and_ have NMIs that can
> generate page faults.  We also have the #PF vmexit part for nested
> virtualization.  This adds up and makes a quick fix for 'page not ready'
> notifications not that quick.
> 
> However, interrupts for 'page ready' do have a bunch of advantages (more
> control on what can be preempted by the notification, a saner check for
> new page faults which is effectively a bug fix) so it makes sense to get
> them in more quickly (probably 5.9 at this point due to the massive
> cleanups that are being done around interrupt vectors).

Ah, so the plan is to fix 'page ready' for the current ABI, but keep the
existing 'page not ready' part because it's not thaaaat broken.  Correct?

In that case, is Andy's patch to kill KVM_ASYNC_PF_SEND_ALWAYS in the guest
being picked up?

I'll read through your #VE stuff on Monday :-).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-15 22:23                     ` Paolo Bonzini
  2020-05-15 23:16                       ` Sean Christopherson
@ 2020-05-21 14:59                       ` Vitaly Kuznetsov
  2020-05-22  7:33                         ` Paolo Bonzini
  1 sibling, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-21 14:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vivek Goyal, kvm, x86, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Jim Mattson,
	Gavin Shan, Peter Zijlstra, linux-kernel, Andrew Cooper,
	Sean Christopherson

Paolo Bonzini <pbonzini@redhat.com> writes:

> However, interrupts for 'page ready' do have a bunch of advantages (more
> control on what can be preempted by the notification, a saner check for
> new page faults which is effectively a bug fix) so it makes sense to get
> them in more quickly (probably 5.9 at this point due to the massive
> cleanups that are being done around interrupt vectors).
>

Actually, I have almost no feedback to address in v2 :-) Almost all
discussion are happening around #VE. Don't mean to rush or anything but
if the 'cleanups' are finalized I can hopefully rebase and retest very
quickly as it's only the KVM guest part which intersects with them, the
rest should be KVM-only. But 5.9 is good too)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-11 16:47 ` [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Vitaly Kuznetsov
  2020-05-12 15:27   ` Vivek Goyal
@ 2020-05-21 18:38   ` Vivek Goyal
  2020-05-23 16:34     ` Vitaly Kuznetsov
  1 sibling, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2020-05-21 18:38 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> Currently, APF mechanism relies on the #PF abuse where the token is being
> passed through CR2. If we switch to using interrupts to deliver page-ready
> notifications we need a different way to pass the data. Extent the existing
> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> notifications.
> 
> The newly introduced apf_put_user_ready() temporary puts both reason
> and token information, this will be changed to put token only when we
> switch to interrupt based notifications.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
>  2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 2a8e0b6b9805..e3602a1de136 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
>  
>  struct kvm_vcpu_pv_apf_data {
>  	__u32 reason;

Hi Vitaly,

Given we are redoing it, can we convert "reason" into a flag instead
and use bit 0 for signalling "page not present" Then rest of the 31
bits can be used for other purposes. I potentially want to use one bit to
signal error (if it is known at the time of injecting #PF).

> -	__u8 pad[60];
> +	__u32 pageready_token;
> +	__u8 pad[56];

Given token is 32 bit, for returning error in "page ready" type messages,
I will probably use padding bytes and create pagready_flag and use one
of the bits to signal error.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-21 14:59                       ` Vitaly Kuznetsov
@ 2020-05-22  7:33                         ` Paolo Bonzini
  0 siblings, 0 replies; 43+ messages in thread
From: Paolo Bonzini @ 2020-05-22  7:33 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Vivek Goyal, kvm, x86, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Jim Mattson,
	Gavin Shan, Peter Zijlstra, linux-kernel, Andrew Cooper,
	Sean Christopherson

On 21/05/20 16:59, Vitaly Kuznetsov wrote:
>> However, interrupts for 'page ready' do have a bunch of advantages (more
>> control on what can be preempted by the notification, a saner check for
>> new page faults which is effectively a bug fix) so it makes sense to get
>> them in more quickly (probably 5.9 at this point due to the massive
>> cleanups that are being done around interrupt vectors).
>>
> Actually, I have almost no feedback to address in v2 :-) Almost all
> discussion are happening around #VE. Don't mean to rush or anything but
> if the 'cleanups' are finalized I can hopefully rebase and retest very
> quickly as it's only the KVM guest part which intersects with them, the
> rest should be KVM-only. But 5.9 is good too)

Yeah, going for 5.9 would only be due to the conflicts.  Do send v2
anyway now, so that we can use a merge commit to convert the interrupt
vector to the 5.8 style.

Paolo


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-21 18:38   ` Vivek Goyal
@ 2020-05-23 16:34     ` Vitaly Kuznetsov
  2020-05-26 12:50       ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Vitaly Kuznetsov @ 2020-05-23 16:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
>> Currently, APF mechanism relies on the #PF abuse where the token is being
>> passed through CR2. If we switch to using interrupts to deliver page-ready
>> notifications we need a different way to pass the data. Extent the existing
>> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
>> notifications.
>> 
>> The newly introduced apf_put_user_ready() temporary puts both reason
>> and token information, this will be changed to put token only when we
>> switch to interrupt based notifications.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
>>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
>>  2 files changed, 15 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
>> index 2a8e0b6b9805..e3602a1de136 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
>>  
>>  struct kvm_vcpu_pv_apf_data {
>>  	__u32 reason;
>
> Hi Vitaly,
>
> Given we are redoing it, can we convert "reason" into a flag instead
> and use bit 0 for signalling "page not present" Then rest of the 31
> bits can be used for other purposes. I potentially want to use one bit to
> signal error (if it is known at the time of injecting #PF).

Yes, I think we can do that. The existing KVM_PV_REASON_PAGE_READY and
KVM_PV_REASON_PAGE_NOT_PRESENT are mutually exclusive and can be
converted to flags (we'll only have KVM_PV_REASON_PAGE_NOT_PRESENT in
use when this series is merged).

>
>> -	__u8 pad[60];
>> +	__u32 pageready_token;
>> +	__u8 pad[56];
>
> Given token is 32 bit, for returning error in "page ready" type messages,
> I will probably use padding bytes and create pagready_flag and use one
> of the bits to signal error.

In case we're intended to pass more data in synchronous notifications,
shall we leave some blank space after 'flags' ('reason' previously) and
before 'token'?

-- 
Vitaly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  2020-05-23 16:34     ` Vitaly Kuznetsov
@ 2020-05-26 12:50       ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2020-05-26 12:50 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Wanpeng Li, Sean Christopherson,
	Jim Mattson, Gavin Shan, Peter Zijlstra, linux-kernel

On Sat, May 23, 2020 at 06:34:17PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, May 11, 2020 at 06:47:46PM +0200, Vitaly Kuznetsov wrote:
> >> Currently, APF mechanism relies on the #PF abuse where the token is being
> >> passed through CR2. If we switch to using interrupts to deliver page-ready
> >> notifications we need a different way to pass the data. Extent the existing
> >> 'struct kvm_vcpu_pv_apf_data' with token information for page-ready
> >> notifications.
> >> 
> >> The newly introduced apf_put_user_ready() temporary puts both reason
> >> and token information, this will be changed to put token only when we
> >> switch to interrupt based notifications.
> >> 
> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> ---
> >>  arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
> >>  arch/x86/kvm/x86.c                   | 17 +++++++++++++----
> >>  2 files changed, 15 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> >> index 2a8e0b6b9805..e3602a1de136 100644
> >> --- a/arch/x86/include/uapi/asm/kvm_para.h
> >> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> >> @@ -113,7 +113,8 @@ struct kvm_mmu_op_release_pt {
> >>  
> >>  struct kvm_vcpu_pv_apf_data {
> >>  	__u32 reason;
> >
> > Hi Vitaly,
> >
> > Given we are redoing it, can we convert "reason" into a flag instead
> > and use bit 0 for signalling "page not present" Then rest of the 31
> > bits can be used for other purposes. I potentially want to use one bit to
> > signal error (if it is known at the time of injecting #PF).
> 
> Yes, I think we can do that. The existing KVM_PV_REASON_PAGE_READY and
> KVM_PV_REASON_PAGE_NOT_PRESENT are mutually exclusive and can be
> converted to flags (we'll only have KVM_PV_REASON_PAGE_NOT_PRESENT in
> use when this series is merged).
> 
> >
> >> -	__u8 pad[60];
> >> +	__u32 pageready_token;
> >> +	__u8 pad[56];
> >
> > Given token is 32 bit, for returning error in "page ready" type messages,
> > I will probably use padding bytes and create pagready_flag and use one
> > of the bits to signal error.
> 
> In case we're intended to pass more data in synchronous notifications,
> shall we leave some blank space after 'flags' ('reason' previously) and
> before 'token'?

Given you are planning to move away from using kvm_vcpu_pv_apf_data
for synchronous notifications, I will not be too concerned about it.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2020-05-26 12:50 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 16:47 [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vitaly Kuznetsov
2020-05-11 16:47 ` [PATCH 1/8] Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" Vitaly Kuznetsov
2020-05-11 16:47 ` [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Vitaly Kuznetsov
2020-05-12 15:27   ` Vivek Goyal
2020-05-12 15:40     ` Vitaly Kuznetsov
2020-05-12 15:53       ` Vivek Goyal
2020-05-12 17:50         ` Sean Christopherson
2020-05-13  9:09           ` Vitaly Kuznetsov
2020-05-13 12:52           ` Vivek Goyal
2020-05-15 15:59             ` Paolo Bonzini
2020-05-15 18:46               ` Sean Christopherson
2020-05-15 19:18                 ` Paolo Bonzini
2020-05-15 20:33                   ` Vivek Goyal
2020-05-15 20:53                     ` Sean Christopherson
2020-05-15 20:43                   ` Sean Christopherson
2020-05-15 22:23                     ` Paolo Bonzini
2020-05-15 23:16                       ` Sean Christopherson
2020-05-21 14:59                       ` Vitaly Kuznetsov
2020-05-22  7:33                         ` Paolo Bonzini
2020-05-12 21:15       ` Vivek Goyal
2020-05-21 18:38   ` Vivek Goyal
2020-05-23 16:34     ` Vitaly Kuznetsov
2020-05-26 12:50       ` Vivek Goyal
2020-05-11 16:47 ` [PATCH 3/8] KVM: introduce kvm_read_guest_offset_cached() Vitaly Kuznetsov
2020-05-11 16:47 ` [PATCH 4/8] KVM: x86: interrupt based APF page-ready event delivery Vitaly Kuznetsov
2020-05-12 14:24   ` Vivek Goyal
2020-05-12 15:50     ` Vitaly Kuznetsov
2020-05-12 18:07       ` Vivek Goyal
2020-05-13  9:03         ` Vitaly Kuznetsov
2020-05-13 13:53           ` Vivek Goyal
2020-05-13 14:03             ` Vivek Goyal
2020-05-13 14:23             ` Vitaly Kuznetsov
2020-05-13 18:46               ` Vivek Goyal
2020-05-14  8:08                 ` Vitaly Kuznetsov
2020-05-14 13:31                   ` Vivek Goyal
2020-05-11 16:47 ` [PATCH 5/8] KVM: x86: acknowledgment mechanism for async pf page ready notifications Vitaly Kuznetsov
2020-05-11 16:47 ` [PATCH 6/8] KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT Vitaly Kuznetsov
2020-05-11 16:47 ` [PATCH 7/8] KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery Vitaly Kuznetsov
2020-05-11 16:47 ` [PATCH 8/8] KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault() Vitaly Kuznetsov
2020-05-12 15:32 ` [PATCH 0/8] KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications Vivek Goyal
2020-05-12 16:12   ` Vitaly Kuznetsov
2020-05-13 14:16 ` Vivek Goyal
2020-05-14 18:14   ` Vitaly Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).