All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] kvm,x86: Improve kvm page fault error handling
@ 2020-06-16 21:48 ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: virtio-fs, miklos, stefanha, dgilbert, vgoyal, vkuznets,
	pbonzini, wanpengli, sean.j.christopherson

Hi,

This is an RFC patch series to improve error handling. Compiled and
tested only on x86. Have not tested or thought about nested
configuration yet.

This is built on top of Vitaly's patches sending "page ready" events
using interrupts. But it has not been rebased on top of recent
interrupt rework yet. Patches are also available here.

https://github.com/rhvgoyal/linux/commits/asyncpf-error-v1

Problem
=======
Currently kvm page fault error handling seems very unpredictable. If
a page fault fails and kvm decided not to do async page fault, then
upon error, we exit to user space and qemu prints
"error: kvm run failed Bad address" and associated cpu state and VM
freezes.

But if kvm decided to async page fault, then async_pf_execute() simply
ignores the error code (-EFAULT) returned by get_user_pages_remote()
and injects "page ready" event into guest. Guest retries the faulting
instruction and takes exit again and kvm again retries async page
fault and this cycle continues and forms an infinite loop.

I can reproduce this -EFAULT situation easily. Created a file
(nvdimm.img) and exported it to guest as nvdimm device. Inside the guest
created ext4 filesystem on device and mounted with dax enabled. Now mmap a
file (/mnt/pmem0/foo.txt) and load from it one page at a time. Also
truncate nvdimm.img on host. So when guest tries to load from nvdimm,
its not mapped in page tables anymore (due to truncation) and we take
exit and try to fault in the page. Now we either exit to user space
with bad address or and get into infinite loop depending on state of
filesystem in guest whether at the time of exit we were in kernel mode
or user space mode.

I am implementing DAX support in virtiofs (which is very close to what
nvdimm will do) and I have scenarios where a DAX mapped file in guest
can get truncated on host and page fault errors can happen. I need to
do better error handling instead of guest and host spinning infinitely.
It otherwise sort of creates an attack vector where a kata container
has to mount virtiofs using DAX, mmap a file, and then truncate that
file on host and then access it inside guest and we can hog kvm on
host in this infinite loop of trying to fault in page.

Proposed Solution
=================
So first idea is that how about we make the error behavior uniform. That
is when an error is encountered, we exit to qemu which prints the
error message and VM freezes. That will end the discrepancy in the
behavior of sync/async page fault. First patch of the series does
that.

Second idea is that if we are doing async page fault and if guest is
in a state so that we can inject "page not present" and "page ready"
events, then instead of exiting to user space, send error back to
guest as part of "page ready" event. This will allow guest to do
finer grained error handling. For example, send SIGBUS to offending
process. And whole of the VM does not have to go down. Second patch
implemented it.

Third idea is that find a way to inject error even when async page
fault can't be injected. Now if we disabled any kind of async page
fault delivery if guest is in kernel mode because this was racy.
Once we figure out a race free way  to be able to inject page
fault in guest (using #VE?), then use that to report errors back
to guest even when it is in kernel mode. And that will allow
guest to call fixup_exception() and possibly recover from situation
otherwise panic(). This can only be implemented once we have a
way race free way to inject an async page event into guest. So this
is a future TBD item. For now, if we took exit and guest is in kernel
mode and error happened, we will vcpu_run() will fail and exit
to user space.  

I have only compiled and tested this series on x86. Before I refine
it further, wanted to post it for some feedback and see if this
the right direction or not.

Any feedback or comments are welcome.

Thanks
Vivek 

Vivek Goyal (3):
  kvm,x86: Force sync fault if previous attempts failed
  kvm: Add capability to be able to report async pf error to guest
  kvm, async_pf: Use FOLL_WRITE only for write faults

 Documentation/virt/kvm/cpuid.rst     |  4 +++
 Documentation/virt/kvm/msr.rst       | 10 +++---
 arch/x86/include/asm/kvm_host.h      |  4 +++
 arch/x86/include/asm/kvm_para.h      |  8 ++---
 arch/x86/include/uapi/asm/kvm_para.h | 10 ++++--
 arch/x86/kernel/kvm.c                | 34 +++++++++++++++----
 arch/x86/kvm/cpuid.c                 |  3 +-
 arch/x86/kvm/mmu.h                   |  2 +-
 arch/x86/kvm/mmu/mmu.c               | 11 ++++---
 arch/x86/kvm/x86.c                   | 49 +++++++++++++++++++++++-----
 include/linux/kvm_host.h             |  5 ++-
 virt/kvm/async_pf.c                  | 15 +++++++--
 12 files changed, 119 insertions(+), 36 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Virtio-fs] [RFC PATCH 0/3] kvm, x86: Improve kvm page fault error handling
@ 2020-06-16 21:48 ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: wanpengli, miklos, sean.j.christopherson, virtio-fs, pbonzini,
	vkuznets, vgoyal

Hi,

This is an RFC patch series to improve error handling. Compiled and
tested only on x86. Have not tested or thought about nested
configuration yet.

This is built on top of Vitaly's patches sending "page ready" events
using interrupts. But it has not been rebased on top of recent
interrupt rework yet. Patches are also available here.

https://github.com/rhvgoyal/linux/commits/asyncpf-error-v1

Problem
=======
Currently kvm page fault error handling seems very unpredictable. If
a page fault fails and kvm decided not to do async page fault, then
upon error, we exit to user space and qemu prints
"error: kvm run failed Bad address" and associated cpu state and VM
freezes.

But if kvm decided to async page fault, then async_pf_execute() simply
ignores the error code (-EFAULT) returned by get_user_pages_remote()
and injects "page ready" event into guest. Guest retries the faulting
instruction and takes exit again and kvm again retries async page
fault and this cycle continues and forms an infinite loop.

I can reproduce this -EFAULT situation easily. Created a file
(nvdimm.img) and exported it to guest as nvdimm device. Inside the guest
created ext4 filesystem on device and mounted with dax enabled. Now mmap a
file (/mnt/pmem0/foo.txt) and load from it one page at a time. Also
truncate nvdimm.img on host. So when guest tries to load from nvdimm,
its not mapped in page tables anymore (due to truncation) and we take
exit and try to fault in the page. Now we either exit to user space
with bad address or and get into infinite loop depending on state of
filesystem in guest whether at the time of exit we were in kernel mode
or user space mode.

I am implementing DAX support in virtiofs (which is very close to what
nvdimm will do) and I have scenarios where a DAX mapped file in guest
can get truncated on host and page fault errors can happen. I need to
do better error handling instead of guest and host spinning infinitely.
It otherwise sort of creates an attack vector where a kata container
has to mount virtiofs using DAX, mmap a file, and then truncate that
file on host and then access it inside guest and we can hog kvm on
host in this infinite loop of trying to fault in page.

Proposed Solution
=================
So first idea is that how about we make the error behavior uniform. That
is when an error is encountered, we exit to qemu which prints the
error message and VM freezes. That will end the discrepancy in the
behavior of sync/async page fault. First patch of the series does
that.

Second idea is that if we are doing async page fault and if guest is
in a state so that we can inject "page not present" and "page ready"
events, then instead of exiting to user space, send error back to
guest as part of "page ready" event. This will allow guest to do
finer grained error handling. For example, send SIGBUS to offending
process. And whole of the VM does not have to go down. Second patch
implemented it.

Third idea is that find a way to inject error even when async page
fault can't be injected. Now if we disabled any kind of async page
fault delivery if guest is in kernel mode because this was racy.
Once we figure out a race free way  to be able to inject page
fault in guest (using #VE?), then use that to report errors back
to guest even when it is in kernel mode. And that will allow
guest to call fixup_exception() and possibly recover from situation
otherwise panic(). This can only be implemented once we have a
way race free way to inject an async page event into guest. So this
is a future TBD item. For now, if we took exit and guest is in kernel
mode and error happened, we will vcpu_run() will fail and exit
to user space.  

I have only compiled and tested this series on x86. Before I refine
it further, wanted to post it for some feedback and see if this
the right direction or not.

Any feedback or comments are welcome.

Thanks
Vivek 

Vivek Goyal (3):
  kvm,x86: Force sync fault if previous attempts failed
  kvm: Add capability to be able to report async pf error to guest
  kvm, async_pf: Use FOLL_WRITE only for write faults

 Documentation/virt/kvm/cpuid.rst     |  4 +++
 Documentation/virt/kvm/msr.rst       | 10 +++---
 arch/x86/include/asm/kvm_host.h      |  4 +++
 arch/x86/include/asm/kvm_para.h      |  8 ++---
 arch/x86/include/uapi/asm/kvm_para.h | 10 ++++--
 arch/x86/kernel/kvm.c                | 34 +++++++++++++++----
 arch/x86/kvm/cpuid.c                 |  3 +-
 arch/x86/kvm/mmu.h                   |  2 +-
 arch/x86/kvm/mmu/mmu.c               | 11 ++++---
 arch/x86/kvm/x86.c                   | 49 +++++++++++++++++++++++-----
 include/linux/kvm_host.h             |  5 ++-
 virt/kvm/async_pf.c                  | 15 +++++++--
 12 files changed, 119 insertions(+), 36 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] kvm,x86: Force sync fault if previous attempts failed
  2020-06-16 21:48 ` [Virtio-fs] [RFC PATCH 0/3] kvm, x86: " Vivek Goyal
@ 2020-06-16 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: virtio-fs, miklos, stefanha, dgilbert, vgoyal, vkuznets,
	pbonzini, wanpengli, sean.j.christopherson

Page fault error handling behavior in kvm seems little inconsistent when
page fault reports error. If we are doing fault synchronously
then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
exit to user space and qemu reports error, "error: kvm run failed Bad address".

But if we are doing async page fault, then async_pf_execute() will simply
ignore the error reported by get_user_pages_remote(). It is assumed that
page fault was successful and either a page ready event is injected in
guest or guest is brought out of artificial halt state and run again.
In both the cases when guest retries the instruction, it takes exit
again as page fault was not successful in previous attempt. And then
this infinite loop continues forever.

This patch tries to make this behavior consistent. That is instead of
getting into infinite loop of retrying page fault, exit to user space
and stop VM if page fault error happens. This can be improved by
injecting errors in guest when it is allowed. Later patches can
inject error when a process in guest triggered page fault and
in that case guest process will receive SIGBUS. Currently we don't
have a way to inject errors when guest is in kernel mode. Once we
have race free method to do so, we should be able to inject errors
and guest can do fixup_exception() if caller set it up so (virtio-fs).

When async pf encounters error then save that pfn and when next time
guest retries, do a sync fault instead of async fault. So that if error
is encountered, we exit to qemu and avoid infinite loop.

As of now only one error pfn is stored and that means it could be
overwritten before next a retry from guest happens. But this is
just a hint and if we miss it, some other time we will catch it.
If this becomes an issue, we could maintain an array of error
gfn later to help ease the issue.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.h              |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              | 19 +++++++++++++++++--
 include/linux/kvm_host.h        |  1 +
 virt/kvm/async_pf.c             |  8 ++++++--
 6 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 938497a6ebd7..348a73106556 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
+		gfn_t error_gfn;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0ad06bfe2c2c..6fa085ff07a3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 			     bool accessed_dirty, gpa_t new_eptp);
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 15984dfde427..e207900ab303 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && kvm_can_do_async_pf(vcpu)) {
+	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
 		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 697d1b273a2f..4c5205434b9c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10513,7 +10513,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
@@ -10527,7 +10527,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 	 * If interrupts are off we cannot even use an artificial
 	 * halt state.
 	 */
-	return kvm_arch_interrupt_allowed(vcpu);
+	if (!kvm_arch_interrupt_allowed(vcpu))
+		return false;
+
+	if (vcpu->arch.apf.error_gfn == gfn)
+		return false;
+
+	return true;
 }
 
 bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
@@ -10583,6 +10589,15 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 		kvm_apic_set_irq(vcpu, &irq, NULL);
 	}
 
+	if (work->error_code) {
+		/*
+		 * Page fault failed and we don't have a way to report
+		 * errors back to guest yet. So save this pfn and force
+		 * sync page fault next time.
+		 */
+		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
+	}
+
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 216cdb6581e2..b8558334b184 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -207,6 +207,7 @@ struct kvm_async_pf {
 	struct kvm_arch_async_pf arch;
 	bool   wakeup_all;
 	bool notpresent_injected;
+	int error_code;
 };
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index a36828fbf40a..6b30374a4de1 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -52,6 +52,7 @@ static void async_pf_execute(struct work_struct *work)
 	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
 	int locked = 1;
 	bool first;
+	long ret;
 
 	might_sleep();
 
@@ -61,11 +62,14 @@ static void async_pf_execute(struct work_struct *work)
 	 * access remotely.
 	 */
 	down_read(&mm->mmap_sem);
-	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
-			&locked);
+	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
+				    &locked);
 	if (locked)
 		up_read(&mm->mmap_sem);
 
+	if (ret < 0)
+		apf->error_code = ret;
+
 	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
 		kvm_arch_async_page_present(vcpu, apf);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Virtio-fs] [PATCH 1/3] kvm, x86: Force sync fault if previous attempts failed
@ 2020-06-16 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: wanpengli, miklos, sean.j.christopherson, virtio-fs, pbonzini,
	vkuznets, vgoyal

Page fault error handling behavior in kvm seems little inconsistent when
page fault reports error. If we are doing fault synchronously
then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
exit to user space and qemu reports error, "error: kvm run failed Bad address".

But if we are doing async page fault, then async_pf_execute() will simply
ignore the error reported by get_user_pages_remote(). It is assumed that
page fault was successful and either a page ready event is injected in
guest or guest is brought out of artificial halt state and run again.
In both the cases when guest retries the instruction, it takes exit
again as page fault was not successful in previous attempt. And then
this infinite loop continues forever.

This patch tries to make this behavior consistent. That is instead of
getting into infinite loop of retrying page fault, exit to user space
and stop VM if page fault error happens. This can be improved by
injecting errors in guest when it is allowed. Later patches can
inject error when a process in guest triggered page fault and
in that case guest process will receive SIGBUS. Currently we don't
have a way to inject errors when guest is in kernel mode. Once we
have race free method to do so, we should be able to inject errors
and guest can do fixup_exception() if caller set it up so (virtio-fs).

When async pf encounters error then save that pfn and when next time
guest retries, do a sync fault instead of async fault. So that if error
is encountered, we exit to qemu and avoid infinite loop.

As of now only one error pfn is stored and that means it could be
overwritten before next a retry from guest happens. But this is
just a hint and if we miss it, some other time we will catch it.
If this becomes an issue, we could maintain an array of error
gfn later to help ease the issue.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.h              |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              | 19 +++++++++++++++++--
 include/linux/kvm_host.h        |  1 +
 virt/kvm/async_pf.c             |  8 ++++++--
 6 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 938497a6ebd7..348a73106556 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
+		gfn_t error_gfn;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0ad06bfe2c2c..6fa085ff07a3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 			     bool accessed_dirty, gpa_t new_eptp);
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 15984dfde427..e207900ab303 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && kvm_can_do_async_pf(vcpu)) {
+	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
 		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 697d1b273a2f..4c5205434b9c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10513,7 +10513,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
@@ -10527,7 +10527,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 	 * If interrupts are off we cannot even use an artificial
 	 * halt state.
 	 */
-	return kvm_arch_interrupt_allowed(vcpu);
+	if (!kvm_arch_interrupt_allowed(vcpu))
+		return false;
+
+	if (vcpu->arch.apf.error_gfn == gfn)
+		return false;
+
+	return true;
 }
 
 bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
@@ -10583,6 +10589,15 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 		kvm_apic_set_irq(vcpu, &irq, NULL);
 	}
 
+	if (work->error_code) {
+		/*
+		 * Page fault failed and we don't have a way to report
+		 * errors back to guest yet. So save this pfn and force
+		 * sync page fault next time.
+		 */
+		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
+	}
+
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 216cdb6581e2..b8558334b184 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -207,6 +207,7 @@ struct kvm_async_pf {
 	struct kvm_arch_async_pf arch;
 	bool   wakeup_all;
 	bool notpresent_injected;
+	int error_code;
 };
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index a36828fbf40a..6b30374a4de1 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -52,6 +52,7 @@ static void async_pf_execute(struct work_struct *work)
 	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
 	int locked = 1;
 	bool first;
+	long ret;
 
 	might_sleep();
 
@@ -61,11 +62,14 @@ static void async_pf_execute(struct work_struct *work)
 	 * access remotely.
 	 */
 	down_read(&mm->mmap_sem);
-	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
-			&locked);
+	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
+				    &locked);
 	if (locked)
 		up_read(&mm->mmap_sem);
 
+	if (ret < 0)
+		apf->error_code = ret;
+
 	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
 		kvm_arch_async_page_present(vcpu, apf);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-16 21:48 ` [Virtio-fs] [RFC PATCH 0/3] kvm, x86: " Vivek Goyal
@ 2020-06-16 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: virtio-fs, miklos, stefanha, dgilbert, vgoyal, vkuznets,
	pbonzini, wanpengli, sean.j.christopherson

As of now asynchronous page fault mecahanism assumes host will always be
successful in resolving page fault. So there are only two states, that
is page is not present and page is ready.

If a page is backed by a file and that file has been truncated (as
can be the case with virtio-fs), then page fault handler on host returns
-EFAULT.

As of now async page fault logic does not look at error code (-EFAULT)
returned by get_user_pages_remote() and returns PAGE_READY to guest.
Guest tries to access page and page fault happnes again. And this
gets kvm into an infinite loop. (Killing host process gets kvm out of
this loop though).

This patch adds another state to async page fault logic which allows
host to return error to guest. Once guest knows that async page fault
can't be resolved, it can send SIGBUS to host process (if user space
was accessing the page in question).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/virt/kvm/cpuid.rst     |  4 +++
 Documentation/virt/kvm/msr.rst       | 10 +++++---
 arch/x86/include/asm/kvm_host.h      |  3 +++
 arch/x86/include/asm/kvm_para.h      |  8 +++---
 arch/x86/include/uapi/asm/kvm_para.h | 10 ++++++--
 arch/x86/kernel/kvm.c                | 34 ++++++++++++++++++++-----
 arch/x86/kvm/cpuid.c                 |  3 ++-
 arch/x86/kvm/mmu/mmu.c               |  2 +-
 arch/x86/kvm/x86.c                   | 38 ++++++++++++++++++++--------
 9 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index a7dff9186bed..a4497f3a6e7f 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
                                               async pf acknowledgment msr
                                               0x4b564d07.
 
+KVM_FEATURE_ASYNC_PF_ERROR        15          paravirtualized async PF error
+                                              can be enabled by setting bit 4
+                                              when writing to msr 0x4b564d02
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..513eb8e0b0eb 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -202,19 +202,21 @@ data:
 
 		/* Used for 'page ready' events delivered via interrupt notification */
 		__u32 token;
-
-		__u8 pad[56];
+                __u32 ready_flags;
+		__u8 pad[52];
 		__u32 enabled;
 	  };
 
-	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+	Bits 5 of the MSR is reserved and should be zero. Bit 0 is set to 1
 	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
 	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
 	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
 	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
 	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
 	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
-	CPUID.
+	CPUID. Bit 4 enables reporting of page fault errors if hypervisor
+        encounters errors while faulting in page. Bit 4 can only be set if
+        KVM_FEATURE_ASYNC_PF_ERROR is present in CPUID.
 
 	'Page not present' events are currently always delivered as synthetic
 	#PF exception. During delivery of these events APF CR2 register contains
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 348a73106556..ff8dbc604dbe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
 		gfn_t error_gfn;
+		bool send_pf_error;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
@@ -1680,6 +1681,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
+void kvm_arch_async_page_fault_error(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index bbc43e5411d9..6c738e91ca2c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -89,8 +89,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
-void kvm_async_pf_task_wait_schedule(u32 token);
-void kvm_async_pf_task_wake(u32 token);
+void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode);
+void kvm_async_pf_task_wake(u32 token, bool is_err);
 u32 kvm_read_and_reset_apf_flags(void);
 void kvm_disable_steal_time(void);
 bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
@@ -120,8 +120,8 @@ static inline void kvm_spinlock_init(void)
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
 #else /* CONFIG_KVM_GUEST */
-#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
-#define kvm_async_pf_task_wake(T) do {} while(0)
+#define kvm_async_pf_task_wait_schedule(T, U) do {} while(0)
+#define kvm_async_pf_task_wake(T, I) do {} while(0)
 
 static inline bool kvm_para_available(void)
 {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..b43fd2ddc949 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -32,6 +32,7 @@
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_ASYNC_PF_ERROR	15
 
 #define KVM_HINTS_REALTIME      0
 
@@ -85,6 +86,7 @@ struct kvm_clock_pairing {
 #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
 #define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
+#define KVM_ASYNC_PF_SEND_ERROR			(1 << 4)
 
 /* MSR_KVM_ASYNC_PF_INT */
 #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
@@ -119,14 +121,18 @@ struct kvm_mmu_op_release_pt {
 #define KVM_PV_REASON_PAGE_NOT_PRESENT 1
 #define KVM_PV_REASON_PAGE_READY 2
 
+
+/* Bit flags for ready_flags field */
+#define KVM_PV_REASON_PAGE_ERROR 1
+
 struct kvm_vcpu_pv_apf_data {
 	/* Used for 'page not present' events delivered via #PF */
 	__u32 flags;
 
 	/* Used for 'page ready' events delivered via interrupt notification */
 	__u32 token;
-
-	__u8 pad[56];
+	__u32 ready_flags;
+	__u8 pad[52];
 	__u32 enabled;
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ff95429d206b..2ee9c9d971ae 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ struct kvm_task_sleep_node {
 	struct swait_queue_head wq;
 	u32 token;
 	int cpu;
+	bool is_err;
 };
 
 static struct kvm_task_sleep_head {
@@ -97,7 +98,14 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
 	return NULL;
 }
 
-static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
+static void handle_async_pf_error(bool user_mode)
+{
+	if (user_mode)
+		send_sig_info(SIGBUS, SEND_SIG_PRIV, current);
+}
+
+static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n,
+				    bool user_mode)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -107,6 +115,8 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
 	e = _find_apf_task(b, token);
 	if (e) {
 		/* dummy entry exist -> wake up was delivered ahead of PF */
+		if (e->is_err)
+			handle_async_pf_error(user_mode);
 		hlist_del(&e->link);
 		raw_spin_unlock(&b->lock);
 		kfree(e);
@@ -128,14 +138,14 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
  * Invoked from the async pagefault handling code or from the VM exit page
  * fault handler. In both cases RCU is watching.
  */
-void kvm_async_pf_task_wait_schedule(u32 token)
+void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode)
 {
 	struct kvm_task_sleep_node n;
 	DECLARE_SWAITQUEUE(wait);
 
 	lockdep_assert_irqs_disabled();
 
-	if (!kvm_async_pf_queue_task(token, &n))
+	if (!kvm_async_pf_queue_task(token, &n, user_mode))
 		return;
 
 	for (;;) {
@@ -148,6 +158,8 @@ void kvm_async_pf_task_wait_schedule(u32 token)
 		local_irq_disable();
 	}
 	finish_swait(&n.wq, &wait);
+	if (n.is_err)
+		handle_async_pf_error(user_mode);
 }
 EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
 
@@ -177,7 +189,7 @@ static void apf_task_wake_all(void)
 	}
 }
 
-void kvm_async_pf_task_wake(u32 token)
+void kvm_async_pf_task_wake(u32 token, bool is_err)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -208,9 +220,11 @@ void kvm_async_pf_task_wake(u32 token)
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->is_err = is_err;
 		init_swait_queue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else {
+		n->is_err = is_err;
 		apf_task_wake_one(n);
 	}
 	raw_spin_unlock(&b->lock);
@@ -256,14 +270,15 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 		panic("Host injected async #PF in kernel mode\n");
 
 	/* Page is swapped out by the host. */
-	kvm_async_pf_task_wait_schedule(token);
+	kvm_async_pf_task_wait_schedule(token, user_mode(regs));
 	return true;
 }
 NOKPROBE_SYMBOL(__kvm_handle_async_pf);
 
 __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
 {
-	u32 token;
+	u32 token, ready_flags;
+	bool is_err;
 
 	entering_ack_irq();
 
@@ -271,7 +286,9 @@ __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
 
 	if (__this_cpu_read(apf_reason.enabled)) {
 		token = __this_cpu_read(apf_reason.token);
-		kvm_async_pf_task_wake(token);
+		ready_flags = __this_cpu_read(apf_reason.ready_flags);
+		is_err = ready_flags & KVM_PV_REASON_PAGE_ERROR;
+		kvm_async_pf_task_wake(token, is_err);
 		__this_cpu_write(apf_reason.token, 0);
 		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
 	}
@@ -335,6 +352,9 @@ static void kvm_guest_cpu_init(void)
 
 		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
 
+		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_ERROR))
+			pa |= KVM_ASYNC_PF_SEND_ERROR;
+
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
 		__this_cpu_write(apf_reason.enabled, 1);
 		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 253b8e875ccd..f811f36e0c24 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-			     (1 << KVM_FEATURE_ASYNC_PF_INT);
+			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
+			     (1 << KVM_FEATURE_ASYNC_PF_ERROR);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e207900ab303..634182bb07c7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4175,7 +4175,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
 		vcpu->arch.apf.host_apf_flags = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait_schedule(fault_address);
+		kvm_async_pf_task_wait_schedule(fault_address, 0);
 		local_irq_enable();
 	} else {
 		WARN_ONCE(1, "Unexpected host async PF flags: %x\n", flags);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c5205434b9c..c3b2097f2eaa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2690,8 +2690,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 4:5 are reserved, Should be zero */
-	if (data & 0x30)
+	/* Bits 5 is reserved, Should be zero */
+	if (data & 0x20)
 		return 1;
 
 	vcpu->arch.apf.msr_en_val = data;
@@ -2703,11 +2703,12 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
-					sizeof(u64)))
+					sizeof(u64) + sizeof(u32)))
 		return 1;
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
+	vcpu->arch.apf.send_pf_error = data & KVM_ASYNC_PF_SEND_ERROR;
 
 	kvm_async_pf_wakeup_all(vcpu);
 
@@ -10481,12 +10482,25 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
 				      sizeof(reason));
 }
 
-static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
+static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token,
+				     bool is_err)
 {
 	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data, token);
+	int ret;
+	u32 ready_flags = 0;
+
+	if (is_err && vcpu->arch.apf.send_pf_error)
+		ready_flags = KVM_PV_REASON_PAGE_ERROR;
+
+	ret = kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
+					    &token, offset, sizeof(token));
+	if (ret < 0)
+		return ret;
 
+	offset = offsetof(struct kvm_vcpu_pv_apf_data, ready_flags);
 	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
-					     &token, offset, sizeof(token));
+					    &ready_flags, offset,
+					    sizeof(ready_flags));
 }
 
 static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
@@ -10571,6 +10585,8 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
+	bool present_injected = false;
+
 	struct kvm_lapic_irq irq = {
 		.delivery_mode = APIC_DM_FIXED,
 		.vector = vcpu->arch.apf.vec
@@ -10584,16 +10600,18 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 
 	if ((work->wakeup_all || work->notpresent_injected) &&
 	    kvm_pv_async_pf_enabled(vcpu) &&
-	    !apf_put_user_ready(vcpu, work->arch.token)) {
+	    !apf_put_user_ready(vcpu, work->arch.token, work->error_code)) {
 		vcpu->arch.apf.pageready_pending = true;
 		kvm_apic_set_irq(vcpu, &irq, NULL);
+		present_injected = true;
 	}
 
-	if (work->error_code) {
+	if (work->error_code &&
+	    (!present_injected || !vcpu->arch.apf.send_pf_error)) {
 		/*
-		 * Page fault failed and we don't have a way to report
-		 * errors back to guest yet. So save this pfn and force
-		 * sync page fault next time.
+		 * Page fault failed. If we did not report error back
+		 * to guest, save this pfn and force sync page fault next
+		 * time.
 		 */
 		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
 	}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-16 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: wanpengli, miklos, sean.j.christopherson, virtio-fs, pbonzini,
	vkuznets, vgoyal

As of now asynchronous page fault mecahanism assumes host will always be
successful in resolving page fault. So there are only two states, that
is page is not present and page is ready.

If a page is backed by a file and that file has been truncated (as
can be the case with virtio-fs), then page fault handler on host returns
-EFAULT.

As of now async page fault logic does not look at error code (-EFAULT)
returned by get_user_pages_remote() and returns PAGE_READY to guest.
Guest tries to access page and page fault happnes again. And this
gets kvm into an infinite loop. (Killing host process gets kvm out of
this loop though).

This patch adds another state to async page fault logic which allows
host to return error to guest. Once guest knows that async page fault
can't be resolved, it can send SIGBUS to host process (if user space
was accessing the page in question).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/virt/kvm/cpuid.rst     |  4 +++
 Documentation/virt/kvm/msr.rst       | 10 +++++---
 arch/x86/include/asm/kvm_host.h      |  3 +++
 arch/x86/include/asm/kvm_para.h      |  8 +++---
 arch/x86/include/uapi/asm/kvm_para.h | 10 ++++++--
 arch/x86/kernel/kvm.c                | 34 ++++++++++++++++++++-----
 arch/x86/kvm/cpuid.c                 |  3 ++-
 arch/x86/kvm/mmu/mmu.c               |  2 +-
 arch/x86/kvm/x86.c                   | 38 ++++++++++++++++++++--------
 9 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index a7dff9186bed..a4497f3a6e7f 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
                                               async pf acknowledgment msr
                                               0x4b564d07.
 
+KVM_FEATURE_ASYNC_PF_ERROR        15          paravirtualized async PF error
+                                              can be enabled by setting bit 4
+                                              when writing to msr 0x4b564d02
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..513eb8e0b0eb 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -202,19 +202,21 @@ data:
 
 		/* Used for 'page ready' events delivered via interrupt notification */
 		__u32 token;
-
-		__u8 pad[56];
+                __u32 ready_flags;
+		__u8 pad[52];
 		__u32 enabled;
 	  };
 
-	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+	Bits 5 of the MSR is reserved and should be zero. Bit 0 is set to 1
 	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
 	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
 	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
 	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
 	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
 	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
-	CPUID.
+	CPUID. Bit 4 enables reporting of page fault errors if hypervisor
+        encounters errors while faulting in page. Bit 4 can only be set if
+        KVM_FEATURE_ASYNC_PF_ERROR is present in CPUID.
 
 	'Page not present' events are currently always delivered as synthetic
 	#PF exception. During delivery of these events APF CR2 register contains
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 348a73106556..ff8dbc604dbe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
 		gfn_t error_gfn;
+		bool send_pf_error;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
@@ -1680,6 +1681,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 			       struct kvm_async_pf *work);
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
+void kvm_arch_async_page_fault_error(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
 extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index bbc43e5411d9..6c738e91ca2c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -89,8 +89,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
-void kvm_async_pf_task_wait_schedule(u32 token);
-void kvm_async_pf_task_wake(u32 token);
+void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode);
+void kvm_async_pf_task_wake(u32 token, bool is_err);
 u32 kvm_read_and_reset_apf_flags(void);
 void kvm_disable_steal_time(void);
 bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
@@ -120,8 +120,8 @@ static inline void kvm_spinlock_init(void)
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
 #else /* CONFIG_KVM_GUEST */
-#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
-#define kvm_async_pf_task_wake(T) do {} while(0)
+#define kvm_async_pf_task_wait_schedule(T, U) do {} while(0)
+#define kvm_async_pf_task_wake(T, I) do {} while(0)
 
 static inline bool kvm_para_available(void)
 {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..b43fd2ddc949 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -32,6 +32,7 @@
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_ASYNC_PF_ERROR	15
 
 #define KVM_HINTS_REALTIME      0
 
@@ -85,6 +86,7 @@ struct kvm_clock_pairing {
 #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
 #define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
+#define KVM_ASYNC_PF_SEND_ERROR			(1 << 4)
 
 /* MSR_KVM_ASYNC_PF_INT */
 #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
@@ -119,14 +121,18 @@ struct kvm_mmu_op_release_pt {
 #define KVM_PV_REASON_PAGE_NOT_PRESENT 1
 #define KVM_PV_REASON_PAGE_READY 2
 
+
+/* Bit flags for ready_flags field */
+#define KVM_PV_REASON_PAGE_ERROR 1
+
 struct kvm_vcpu_pv_apf_data {
 	/* Used for 'page not present' events delivered via #PF */
 	__u32 flags;
 
 	/* Used for 'page ready' events delivered via interrupt notification */
 	__u32 token;
-
-	__u8 pad[56];
+	__u32 ready_flags;
+	__u8 pad[52];
 	__u32 enabled;
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ff95429d206b..2ee9c9d971ae 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ struct kvm_task_sleep_node {
 	struct swait_queue_head wq;
 	u32 token;
 	int cpu;
+	bool is_err;
 };
 
 static struct kvm_task_sleep_head {
@@ -97,7 +98,14 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
 	return NULL;
 }
 
-static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
+static void handle_async_pf_error(bool user_mode)
+{
+	if (user_mode)
+		send_sig_info(SIGBUS, SEND_SIG_PRIV, current);
+}
+
+static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n,
+				    bool user_mode)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -107,6 +115,8 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
 	e = _find_apf_task(b, token);
 	if (e) {
 		/* dummy entry exist -> wake up was delivered ahead of PF */
+		if (e->is_err)
+			handle_async_pf_error(user_mode);
 		hlist_del(&e->link);
 		raw_spin_unlock(&b->lock);
 		kfree(e);
@@ -128,14 +138,14 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
  * Invoked from the async pagefault handling code or from the VM exit page
  * fault handler. In both cases RCU is watching.
  */
-void kvm_async_pf_task_wait_schedule(u32 token)
+void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode)
 {
 	struct kvm_task_sleep_node n;
 	DECLARE_SWAITQUEUE(wait);
 
 	lockdep_assert_irqs_disabled();
 
-	if (!kvm_async_pf_queue_task(token, &n))
+	if (!kvm_async_pf_queue_task(token, &n, user_mode))
 		return;
 
 	for (;;) {
@@ -148,6 +158,8 @@ void kvm_async_pf_task_wait_schedule(u32 token)
 		local_irq_disable();
 	}
 	finish_swait(&n.wq, &wait);
+	if (n.is_err)
+		handle_async_pf_error(user_mode);
 }
 EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
 
@@ -177,7 +189,7 @@ static void apf_task_wake_all(void)
 	}
 }
 
-void kvm_async_pf_task_wake(u32 token)
+void kvm_async_pf_task_wake(u32 token, bool is_err)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -208,9 +220,11 @@ void kvm_async_pf_task_wake(u32 token)
 		}
 		n->token = token;
 		n->cpu = smp_processor_id();
+		n->is_err = is_err;
 		init_swait_queue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
 	} else {
+		n->is_err = is_err;
 		apf_task_wake_one(n);
 	}
 	raw_spin_unlock(&b->lock);
@@ -256,14 +270,15 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 		panic("Host injected async #PF in kernel mode\n");
 
 	/* Page is swapped out by the host. */
-	kvm_async_pf_task_wait_schedule(token);
+	kvm_async_pf_task_wait_schedule(token, user_mode(regs));
 	return true;
 }
 NOKPROBE_SYMBOL(__kvm_handle_async_pf);
 
 __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
 {
-	u32 token;
+	u32 token, ready_flags;
+	bool is_err;
 
 	entering_ack_irq();
 
@@ -271,7 +286,9 @@ __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
 
 	if (__this_cpu_read(apf_reason.enabled)) {
 		token = __this_cpu_read(apf_reason.token);
-		kvm_async_pf_task_wake(token);
+		ready_flags = __this_cpu_read(apf_reason.ready_flags);
+		is_err = ready_flags & KVM_PV_REASON_PAGE_ERROR;
+		kvm_async_pf_task_wake(token, is_err);
 		__this_cpu_write(apf_reason.token, 0);
 		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
 	}
@@ -335,6 +352,9 @@ static void kvm_guest_cpu_init(void)
 
 		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
 
+		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_ERROR))
+			pa |= KVM_ASYNC_PF_SEND_ERROR;
+
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
 		__this_cpu_write(apf_reason.enabled, 1);
 		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 253b8e875ccd..f811f36e0c24 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-			     (1 << KVM_FEATURE_ASYNC_PF_INT);
+			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
+			     (1 << KVM_FEATURE_ASYNC_PF_ERROR);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e207900ab303..634182bb07c7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4175,7 +4175,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
 		vcpu->arch.apf.host_apf_flags = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait_schedule(fault_address);
+		kvm_async_pf_task_wait_schedule(fault_address, 0);
 		local_irq_enable();
 	} else {
 		WARN_ONCE(1, "Unexpected host async PF flags: %x\n", flags);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c5205434b9c..c3b2097f2eaa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2690,8 +2690,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 4:5 are reserved, Should be zero */
-	if (data & 0x30)
+	/* Bits 5 is reserved, Should be zero */
+	if (data & 0x20)
 		return 1;
 
 	vcpu->arch.apf.msr_en_val = data;
@@ -2703,11 +2703,12 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
-					sizeof(u64)))
+					sizeof(u64) + sizeof(u32)))
 		return 1;
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
+	vcpu->arch.apf.send_pf_error = data & KVM_ASYNC_PF_SEND_ERROR;
 
 	kvm_async_pf_wakeup_all(vcpu);
 
@@ -10481,12 +10482,25 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
 				      sizeof(reason));
 }
 
-static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
+static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token,
+				     bool is_err)
 {
 	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data, token);
+	int ret;
+	u32 ready_flags = 0;
+
+	if (is_err && vcpu->arch.apf.send_pf_error)
+		ready_flags = KVM_PV_REASON_PAGE_ERROR;
+
+	ret = kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
+					    &token, offset, sizeof(token));
+	if (ret < 0)
+		return ret;
 
+	offset = offsetof(struct kvm_vcpu_pv_apf_data, ready_flags);
 	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
-					     &token, offset, sizeof(token));
+					    &ready_flags, offset,
+					    sizeof(ready_flags));
 }
 
 static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
@@ -10571,6 +10585,8 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
+	bool present_injected = false;
+
 	struct kvm_lapic_irq irq = {
 		.delivery_mode = APIC_DM_FIXED,
 		.vector = vcpu->arch.apf.vec
@@ -10584,16 +10600,18 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 
 	if ((work->wakeup_all || work->notpresent_injected) &&
 	    kvm_pv_async_pf_enabled(vcpu) &&
-	    !apf_put_user_ready(vcpu, work->arch.token)) {
+	    !apf_put_user_ready(vcpu, work->arch.token, work->error_code)) {
 		vcpu->arch.apf.pageready_pending = true;
 		kvm_apic_set_irq(vcpu, &irq, NULL);
+		present_injected = true;
 	}
 
-	if (work->error_code) {
+	if (work->error_code &&
+	    (!present_injected || !vcpu->arch.apf.send_pf_error)) {
 		/*
-		 * Page fault failed and we don't have a way to report
-		 * errors back to guest yet. So save this pfn and force
-		 * sync page fault next time.
+		 * Page fault failed. If we did not report error back
+		 * to guest, save this pfn and force sync page fault next
+		 * time.
 		 */
 		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
 	}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] kvm, async_pf: Use FOLL_WRITE only for write faults
  2020-06-16 21:48 ` [Virtio-fs] [RFC PATCH 0/3] kvm, x86: " Vivek Goyal
@ 2020-06-16 21:48   ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: virtio-fs, miklos, stefanha, dgilbert, vgoyal, vkuznets,
	pbonzini, wanpengli, sean.j.christopherson

async_pf_execute() calls get_user_pages_remote() and uses FOLL_WRITE
always. It does not matter whether fault happened due to read/write.

This creates issues if vma is mapped for reading and does not have
VM_WRITE set and check_vma_flags() fails in __get_user_pages() and
get_user_pages_remote() returns -EFAULT.

So far this was not an issue, as we don't care about return code
from get_user_pages_remote(). But soon we want to look at this error
code and if file got truncated and page can't be faulted in, we want
to inject error in guest and let guest take appropriate action (Either
send SIGBUS to guest process or do exception table handling or possibly
die).

Hence, we don't want to get -EFAULT erroneously. Pass FOLL_WRITE only
if it is write fault.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c   | 7 ++++---
 include/linux/kvm_host.h | 4 +++-
 virt/kvm/async_pf.c      | 9 +++++++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 634182bb07c7..15969c4c8925 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4046,7 +4046,7 @@ static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
 }
 
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-				   gfn_t gfn)
+				   gfn_t gfn, bool write)
 {
 	struct kvm_arch_async_pf arch;
 
@@ -4056,7 +4056,8 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
 
 	return kvm_setup_async_pf(vcpu, cr2_or_gpa,
-				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch,
+				  write);
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
@@ -4084,7 +4085,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
 			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 			return true;
-		} else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
+		} else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn, write))
 			return true;
 	}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b8558334b184..a7c3999a7374 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -208,12 +208,14 @@ struct kvm_async_pf {
 	bool   wakeup_all;
 	bool notpresent_injected;
 	int error_code;
+	bool write;
 };
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-		       unsigned long hva, struct kvm_arch_async_pf *arch);
+		       unsigned long hva, struct kvm_arch_async_pf *arch,
+		       bool write);
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 6b30374a4de1..1d2dc267f9b8 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -53,6 +53,7 @@ static void async_pf_execute(struct work_struct *work)
 	int locked = 1;
 	bool first;
 	long ret;
+	unsigned int gup_flags = 0;
 
 	might_sleep();
 
@@ -61,8 +62,10 @@ static void async_pf_execute(struct work_struct *work)
 	 * mm and might be done in another context, so we must
 	 * access remotely.
 	 */
+	if (apf->write)
+		gup_flags = FOLL_WRITE;
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
+	ret = get_user_pages_remote(NULL, mm, addr, 1, gup_flags, NULL, NULL,
 				    &locked);
 	if (locked)
 		up_read(&mm->mmap_sem);
@@ -161,7 +164,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 }
 
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-		       unsigned long hva, struct kvm_arch_async_pf *arch)
+		       unsigned long hva, struct kvm_arch_async_pf *arch,
+		       bool write)
 {
 	struct kvm_async_pf *work;
 
@@ -186,6 +190,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	work->addr = hva;
 	work->arch = *arch;
 	work->mm = current->mm;
+	work->write = write;
 	mmget(work->mm);
 	kvm_get_kvm(work->vcpu->kvm);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Virtio-fs] [PATCH 3/3] kvm, async_pf: Use FOLL_WRITE only for write faults
@ 2020-06-16 21:48   ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-16 21:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: wanpengli, miklos, sean.j.christopherson, virtio-fs, pbonzini,
	vkuznets, vgoyal

async_pf_execute() calls get_user_pages_remote() and uses FOLL_WRITE
always. It does not matter whether fault happened due to read/write.

This creates issues if vma is mapped for reading and does not have
VM_WRITE set and check_vma_flags() fails in __get_user_pages() and
get_user_pages_remote() returns -EFAULT.

So far this was not an issue, as we don't care about return code
from get_user_pages_remote(). But soon we want to look at this error
code and if file got truncated and page can't be faulted in, we want
to inject error in guest and let guest take appropriate action (Either
send SIGBUS to guest process or do exception table handling or possibly
die).

Hence, we don't want to get -EFAULT erroneously. Pass FOLL_WRITE only
if it is write fault.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c   | 7 ++++---
 include/linux/kvm_host.h | 4 +++-
 virt/kvm/async_pf.c      | 9 +++++++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 634182bb07c7..15969c4c8925 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4046,7 +4046,7 @@ static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
 }
 
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-				   gfn_t gfn)
+				   gfn_t gfn, bool write)
 {
 	struct kvm_arch_async_pf arch;
 
@@ -4056,7 +4056,8 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
 
 	return kvm_setup_async_pf(vcpu, cr2_or_gpa,
-				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch,
+				  write);
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
@@ -4084,7 +4085,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
 			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 			return true;
-		} else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
+		} else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn, write))
 			return true;
 	}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b8558334b184..a7c3999a7374 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -208,12 +208,14 @@ struct kvm_async_pf {
 	bool   wakeup_all;
 	bool notpresent_injected;
 	int error_code;
+	bool write;
 };
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-		       unsigned long hva, struct kvm_arch_async_pf *arch);
+		       unsigned long hva, struct kvm_arch_async_pf *arch,
+		       bool write);
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 6b30374a4de1..1d2dc267f9b8 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -53,6 +53,7 @@ static void async_pf_execute(struct work_struct *work)
 	int locked = 1;
 	bool first;
 	long ret;
+	unsigned int gup_flags = 0;
 
 	might_sleep();
 
@@ -61,8 +62,10 @@ static void async_pf_execute(struct work_struct *work)
 	 * mm and might be done in another context, so we must
 	 * access remotely.
 	 */
+	if (apf->write)
+		gup_flags = FOLL_WRITE;
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
+	ret = get_user_pages_remote(NULL, mm, addr, 1, gup_flags, NULL, NULL,
 				    &locked);
 	if (locked)
 		up_read(&mm->mmap_sem);
@@ -161,7 +164,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 }
 
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-		       unsigned long hva, struct kvm_arch_async_pf *arch)
+		       unsigned long hva, struct kvm_arch_async_pf *arch,
+		       bool write)
 {
 	struct kvm_async_pf *work;
 
@@ -186,6 +190,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	work->addr = hva;
 	work->arch = *arch;
 	work->mm = current->mm;
+	work->write = write;
 	mmget(work->mm);
 	kvm_get_kvm(work->vcpu->kvm);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] kvm,x86: Force sync fault if previous attempts failed
  2020-06-16 21:48   ` [Virtio-fs] [PATCH 1/3] kvm, x86: " Vivek Goyal
@ 2020-06-17 13:02     ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 26+ messages in thread
From: Vitaly Kuznetsov @ 2020-06-17 13:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs, miklos, stefanha, dgilbert, pbonzini, wanpengli,
	sean.j.christopherson, kvm, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad address".
>
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote(). It is assumed that
> page fault was successful and either a page ready event is injected in
> guest or guest is brought out of artificial halt state and run again.
> In both the cases when guest retries the instruction, it takes exit
> again as page fault was not successful in previous attempt. And then
> this infinite loop continues forever.
>

Indeed, we always assume get_user_pages_remote() will fetch the page we
need. 

> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens. This can be improved by
> injecting errors in guest when it is allowed. Later patches can
> inject error when a process in guest triggered page fault and
> in that case guest process will receive SIGBUS. Currently we don't
> have a way to inject errors when guest is in kernel mode. Once we
> have race free method to do so, we should be able to inject errors
> and guest can do fixup_exception() if caller set it up so (virtio-fs).
>
> When async pf encounters error then save that pfn and when next time
> guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
>
> As of now only one error pfn is stored and that means it could be
> overwritten before next a retry from guest happens. But this is
> just a hint and if we miss it, some other time we will catch it.
> If this becomes an issue, we could maintain an array of error
> gfn later to help ease the issue.

With a single GFN stored we can probably get in the same infinite loop
you describe when several processes try to access different unavailable
GFNs. Also, is the condition "GFN is unavailable" temporary or
permanent? I guess it's the former but 'error_gfn' is set permanently
(assuming no other GFN will overwrite it).

What if we do the following instead: add one more field to APF data
indicating an error. The guest will kill the corresponding process
instead of resuming it then.

Also, with KVM_ASYNC_PF_SEND_ALWAYS being deprecated and assuming we
switch to #VE, under which curcumstances we'll be unable to handle page
fault asyncronously? Apart from overflowing the async PF queue, are
there any other scenarios when it still makes sense to keep the guest
alive? 

>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 19 +++++++++++++++++--
>  include/linux/kvm_host.h        |  1 +
>  virt/kvm/async_pf.c             |  8 ++++++--
>  6 files changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 938497a6ebd7..348a73106556 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		gfn_t error_gfn;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 0ad06bfe2c2c..6fa085ff07a3 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  			     bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 15984dfde427..e207900ab303 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
>  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 697d1b273a2f..4c5205434b9c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10513,7 +10513,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
> @@ -10527,7 +10527,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
>  	 */
> -	return kvm_arch_interrupt_allowed(vcpu);
> +	if (!kvm_arch_interrupt_allowed(vcpu))
> +		return false;
> +
> +	if (vcpu->arch.apf.error_gfn == gfn)
> +		return false;
> +
> +	return true;
>  }
>  
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> @@ -10583,6 +10589,15 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  		kvm_apic_set_irq(vcpu, &irq, NULL);
>  	}
>  
> +	if (work->error_code) {
> +		/*
> +		 * Page fault failed and we don't have a way to report
> +		 * errors back to guest yet. So save this pfn and force
> +		 * sync page fault next time.
> +		 */
> +		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
> +	}
> +
>  	vcpu->arch.apf.halted = false;
>  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 216cdb6581e2..b8558334b184 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -207,6 +207,7 @@ struct kvm_async_pf {
>  	struct kvm_arch_async_pf arch;
>  	bool   wakeup_all;
>  	bool notpresent_injected;
> +	int error_code;
>  };
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index a36828fbf40a..6b30374a4de1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -52,6 +52,7 @@ static void async_pf_execute(struct work_struct *work)
>  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
>  	int locked = 1;
>  	bool first;
> +	long ret;
>  
>  	might_sleep();
>  
> @@ -61,11 +62,14 @@ static void async_pf_execute(struct work_struct *work)
>  	 * access remotely.
>  	 */
>  	down_read(&mm->mmap_sem);
> -	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> -			&locked);
> +	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> +				    &locked);
>  	if (locked)
>  		up_read(&mm->mmap_sem);
>  
> +	if (ret < 0)
> +		apf->error_code = ret;
> +
>  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
>  		kvm_arch_async_page_present(vcpu, apf);

-- 
Vitaly


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 1/3] kvm, x86: Force sync fault if previous attempts failed
@ 2020-06-17 13:02     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 26+ messages in thread
From: Vitaly Kuznetsov @ 2020-06-17 13:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: wanpengli, kvm, miklos, sean.j.christopherson, linux-kernel,
	virtio-fs, pbonzini

Vivek Goyal <vgoyal@redhat.com> writes:

> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad address".
>
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote(). It is assumed that
> page fault was successful and either a page ready event is injected in
> guest or guest is brought out of artificial halt state and run again.
> In both the cases when guest retries the instruction, it takes exit
> again as page fault was not successful in previous attempt. And then
> this infinite loop continues forever.
>

Indeed, we always assume get_user_pages_remote() will fetch the page we
need. 

> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens. This can be improved by
> injecting errors in guest when it is allowed. Later patches can
> inject error when a process in guest triggered page fault and
> in that case guest process will receive SIGBUS. Currently we don't
> have a way to inject errors when guest is in kernel mode. Once we
> have race free method to do so, we should be able to inject errors
> and guest can do fixup_exception() if caller set it up so (virtio-fs).
>
> When async pf encounters error then save that pfn and when next time
> guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
>
> As of now only one error pfn is stored and that means it could be
> overwritten before next a retry from guest happens. But this is
> just a hint and if we miss it, some other time we will catch it.
> If this becomes an issue, we could maintain an array of error
> gfn later to help ease the issue.

With a single GFN stored we can probably get in the same infinite loop
you describe when several processes try to access different unavailable
GFNs. Also, is the condition "GFN is unavailable" temporary or
permanent? I guess it's the former but 'error_gfn' is set permanently
(assuming no other GFN will overwrite it).

What if we do the following instead: add one more field to APF data
indicating an error. The guest will kill the corresponding process
instead of resuming it then.

Also, with KVM_ASYNC_PF_SEND_ALWAYS being deprecated and assuming we
switch to #VE, under which curcumstances we'll be unable to handle page
fault asyncronously? Apart from overflowing the async PF queue, are
there any other scenarios when it still makes sense to keep the guest
alive? 

>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 19 +++++++++++++++++--
>  include/linux/kvm_host.h        |  1 +
>  virt/kvm/async_pf.c             |  8 ++++++--
>  6 files changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 938497a6ebd7..348a73106556 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		gfn_t error_gfn;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 0ad06bfe2c2c..6fa085ff07a3 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  			     bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 15984dfde427..e207900ab303 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
>  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 697d1b273a2f..4c5205434b9c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10513,7 +10513,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
> @@ -10527,7 +10527,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
>  	 */
> -	return kvm_arch_interrupt_allowed(vcpu);
> +	if (!kvm_arch_interrupt_allowed(vcpu))
> +		return false;
> +
> +	if (vcpu->arch.apf.error_gfn == gfn)
> +		return false;
> +
> +	return true;
>  }
>  
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> @@ -10583,6 +10589,15 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  		kvm_apic_set_irq(vcpu, &irq, NULL);
>  	}
>  
> +	if (work->error_code) {
> +		/*
> +		 * Page fault failed and we don't have a way to report
> +		 * errors back to guest yet. So save this pfn and force
> +		 * sync page fault next time.
> +		 */
> +		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
> +	}
> +
>  	vcpu->arch.apf.halted = false;
>  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 216cdb6581e2..b8558334b184 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -207,6 +207,7 @@ struct kvm_async_pf {
>  	struct kvm_arch_async_pf arch;
>  	bool   wakeup_all;
>  	bool notpresent_injected;
> +	int error_code;
>  };
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index a36828fbf40a..6b30374a4de1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -52,6 +52,7 @@ static void async_pf_execute(struct work_struct *work)
>  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
>  	int locked = 1;
>  	bool first;
> +	long ret;
>  
>  	might_sleep();
>  
> @@ -61,11 +62,14 @@ static void async_pf_execute(struct work_struct *work)
>  	 * access remotely.
>  	 */
>  	down_read(&mm->mmap_sem);
> -	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> -			&locked);
> +	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> +				    &locked);
>  	if (locked)
>  		up_read(&mm->mmap_sem);
>  
> +	if (ret < 0)
> +		apf->error_code = ret;
> +
>  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
>  		kvm_arch_async_page_present(vcpu, apf);

-- 
Vitaly


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-16 21:48   ` [Virtio-fs] " Vivek Goyal
@ 2020-06-17 13:12     ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 26+ messages in thread
From: Vitaly Kuznetsov @ 2020-06-17 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs, miklos, stefanha, dgilbert, pbonzini, wanpengli,
	sean.j.christopherson, kvm, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> As of now asynchronous page fault mecahanism assumes host will always be
> successful in resolving page fault. So there are only two states, that
> is page is not present and page is ready.
>
> If a page is backed by a file and that file has been truncated (as
> can be the case with virtio-fs), then page fault handler on host returns
> -EFAULT.
>
> As of now async page fault logic does not look at error code (-EFAULT)
> returned by get_user_pages_remote() and returns PAGE_READY to guest.
> Guest tries to access page and page fault happnes again. And this
> gets kvm into an infinite loop. (Killing host process gets kvm out of
> this loop though).
>
> This patch adds another state to async page fault logic which allows
> host to return error to guest. Once guest knows that async page fault
> can't be resolved, it can send SIGBUS to host process (if user space
> was accessing the page in question).
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/virt/kvm/cpuid.rst     |  4 +++
>  Documentation/virt/kvm/msr.rst       | 10 +++++---
>  arch/x86/include/asm/kvm_host.h      |  3 +++
>  arch/x86/include/asm/kvm_para.h      |  8 +++---
>  arch/x86/include/uapi/asm/kvm_para.h | 10 ++++++--
>  arch/x86/kernel/kvm.c                | 34 ++++++++++++++++++++-----
>  arch/x86/kvm/cpuid.c                 |  3 ++-
>  arch/x86/kvm/mmu/mmu.c               |  2 +-
>  arch/x86/kvm/x86.c                   | 38 ++++++++++++++++++++--------
>  9 files changed, 83 insertions(+), 29 deletions(-)
>
> diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> index a7dff9186bed..a4497f3a6e7f 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
>                                                async pf acknowledgment msr
>                                                0x4b564d07.
>  
> +KVM_FEATURE_ASYNC_PF_ERROR        15          paravirtualized async PF error
> +                                              can be enabled by setting bit 4
> +                                              when writing to msr 0x4b564d02
> +
>  KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
>                                                per-cpu warps are expeced in
>                                                kvmclock
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index e37a14c323d2..513eb8e0b0eb 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -202,19 +202,21 @@ data:
>  
>  		/* Used for 'page ready' events delivered via interrupt notification */
>  		__u32 token;
> -
> -		__u8 pad[56];
> +                __u32 ready_flags;
> +		__u8 pad[52];
>  		__u32 enabled;
>  	  };
>  
> -	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> +	Bits 5 of the MSR is reserved and should be zero. Bit 0 is set to 1
>  	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
>  	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
>  	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
>  	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
>  	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
>  	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
> -	CPUID.
> +	CPUID. Bit 4 enables reporting of page fault errors if hypervisor
> +        encounters errors while faulting in page. Bit 4 can only be set if
> +        KVM_FEATURE_ASYNC_PF_ERROR is present in CPUID.

Would it actually make sense to deliver the exact error from
get_user_pages() and not a binary failure/no-failure? Is the guest
interested in how exactly we failed?

>  
>  	'Page not present' events are currently always delivered as synthetic
>  	#PF exception. During delivery of these events APF CR2 register contains
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 348a73106556..ff8dbc604dbe 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
>  		gfn_t error_gfn;
> +		bool send_pf_error;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> @@ -1680,6 +1681,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>  			       struct kvm_async_pf *work);
>  void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
>  bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
> +void kvm_arch_async_page_fault_error(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
>  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
>  
>  int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bbc43e5411d9..6c738e91ca2c 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -89,8 +89,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  unsigned int kvm_arch_para_hints(void);
> -void kvm_async_pf_task_wait_schedule(u32 token);
> -void kvm_async_pf_task_wake(u32 token);
> +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode);
> +void kvm_async_pf_task_wake(u32 token, bool is_err);
>  u32 kvm_read_and_reset_apf_flags(void);
>  void kvm_disable_steal_time(void);
>  bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
> @@ -120,8 +120,8 @@ static inline void kvm_spinlock_init(void)
>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  
>  #else /* CONFIG_KVM_GUEST */
> -#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
> -#define kvm_async_pf_task_wake(T) do {} while(0)
> +#define kvm_async_pf_task_wait_schedule(T, U) do {} while(0)
> +#define kvm_async_pf_task_wake(T, I) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
>  {
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 812e9b4c1114..b43fd2ddc949 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -32,6 +32,7 @@
>  #define KVM_FEATURE_POLL_CONTROL	12
>  #define KVM_FEATURE_PV_SCHED_YIELD	13
>  #define KVM_FEATURE_ASYNC_PF_INT	14
> +#define KVM_FEATURE_ASYNC_PF_ERROR	15
>  
>  #define KVM_HINTS_REALTIME      0
>  
> @@ -85,6 +86,7 @@ struct kvm_clock_pairing {
>  #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
>  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
>  #define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
> +#define KVM_ASYNC_PF_SEND_ERROR			(1 << 4)
>  
>  /* MSR_KVM_ASYNC_PF_INT */
>  #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
> @@ -119,14 +121,18 @@ struct kvm_mmu_op_release_pt {
>  #define KVM_PV_REASON_PAGE_NOT_PRESENT 1
>  #define KVM_PV_REASON_PAGE_READY 2
>  
> +
> +/* Bit flags for ready_flags field */
> +#define KVM_PV_REASON_PAGE_ERROR 1
> +
>  struct kvm_vcpu_pv_apf_data {
>  	/* Used for 'page not present' events delivered via #PF */
>  	__u32 flags;
>  
>  	/* Used for 'page ready' events delivered via interrupt notification */
>  	__u32 token;
> -
> -	__u8 pad[56];
> +	__u32 ready_flags;
> +	__u8 pad[52];
>  	__u32 enabled;
>  };
>  
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index ff95429d206b..2ee9c9d971ae 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,6 +75,7 @@ struct kvm_task_sleep_node {
>  	struct swait_queue_head wq;
>  	u32 token;
>  	int cpu;
> +	bool is_err;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -97,7 +98,14 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>  	return NULL;
>  }
>  
> -static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> +static void handle_async_pf_error(bool user_mode)
> +{
> +	if (user_mode)
> +		send_sig_info(SIGBUS, SEND_SIG_PRIV, current);
> +}
> +
> +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n,
> +				    bool user_mode)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -107,6 +115,8 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
>  	e = _find_apf_task(b, token);
>  	if (e) {
>  		/* dummy entry exist -> wake up was delivered ahead of PF */
> +		if (e->is_err)
> +			handle_async_pf_error(user_mode);
>  		hlist_del(&e->link);
>  		raw_spin_unlock(&b->lock);
>  		kfree(e);
> @@ -128,14 +138,14 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
>   * Invoked from the async pagefault handling code or from the VM exit page
>   * fault handler. In both cases RCU is watching.
>   */
> -void kvm_async_pf_task_wait_schedule(u32 token)
> +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode)
>  {
>  	struct kvm_task_sleep_node n;
>  	DECLARE_SWAITQUEUE(wait);
>  
>  	lockdep_assert_irqs_disabled();
>  
> -	if (!kvm_async_pf_queue_task(token, &n))
> +	if (!kvm_async_pf_queue_task(token, &n, user_mode))
>  		return;
>  
>  	for (;;) {
> @@ -148,6 +158,8 @@ void kvm_async_pf_task_wait_schedule(u32 token)
>  		local_irq_disable();
>  	}
>  	finish_swait(&n.wq, &wait);
> +	if (n.is_err)
> +		handle_async_pf_error(user_mode);
>  }
>  EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>  
> @@ -177,7 +189,7 @@ static void apf_task_wake_all(void)
>  	}
>  }
>  
> -void kvm_async_pf_task_wake(u32 token)
> +void kvm_async_pf_task_wake(u32 token, bool is_err)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -208,9 +220,11 @@ void kvm_async_pf_task_wake(u32 token)
>  		}
>  		n->token = token;
>  		n->cpu = smp_processor_id();
> +		n->is_err = is_err;
>  		init_swait_queue_head(&n->wq);
>  		hlist_add_head(&n->link, &b->list);
>  	} else {
> +		n->is_err = is_err;
>  		apf_task_wake_one(n);
>  	}
>  	raw_spin_unlock(&b->lock);
> @@ -256,14 +270,15 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  		panic("Host injected async #PF in kernel mode\n");
>  
>  	/* Page is swapped out by the host. */
> -	kvm_async_pf_task_wait_schedule(token);
> +	kvm_async_pf_task_wait_schedule(token, user_mode(regs));
>  	return true;
>  }
>  NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>  
>  __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
>  {
> -	u32 token;
> +	u32 token, ready_flags;
> +	bool is_err;
>  
>  	entering_ack_irq();
>  
> @@ -271,7 +286,9 @@ __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
>  
>  	if (__this_cpu_read(apf_reason.enabled)) {
>  		token = __this_cpu_read(apf_reason.token);
> -		kvm_async_pf_task_wake(token);
> +		ready_flags = __this_cpu_read(apf_reason.ready_flags);
> +		is_err = ready_flags & KVM_PV_REASON_PAGE_ERROR;
> +		kvm_async_pf_task_wake(token, is_err);
>  		__this_cpu_write(apf_reason.token, 0);
>  		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
>  	}
> @@ -335,6 +352,9 @@ static void kvm_guest_cpu_init(void)
>  
>  		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
>  
> +		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_ERROR))
> +			pa |= KVM_ASYNC_PF_SEND_ERROR;
> +
>  		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
>  		__this_cpu_write(apf_reason.enabled, 1);
>  		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 253b8e875ccd..f811f36e0c24 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  			     (1 << KVM_FEATURE_PV_SEND_IPI) |
>  			     (1 << KVM_FEATURE_POLL_CONTROL) |
>  			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
> -			     (1 << KVM_FEATURE_ASYNC_PF_INT);
> +			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
> +			     (1 << KVM_FEATURE_ASYNC_PF_ERROR);
>  
>  		if (sched_info_on())
>  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e207900ab303..634182bb07c7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4175,7 +4175,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		vcpu->arch.apf.host_apf_flags = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait_schedule(fault_address);
> +		kvm_async_pf_task_wait_schedule(fault_address, 0);
>  		local_irq_enable();
>  	} else {
>  		WARN_ONCE(1, "Unexpected host async PF flags: %x\n", flags);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4c5205434b9c..c3b2097f2eaa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2690,8 +2690,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>  {
>  	gpa_t gpa = data & ~0x3f;
>  
> -	/* Bits 4:5 are reserved, Should be zero */
> -	if (data & 0x30)
> +	/* Bits 5 is reserved, Should be zero */
> +	if (data & 0x20)
>  		return 1;
>  
>  	vcpu->arch.apf.msr_en_val = data;
> @@ -2703,11 +2703,12 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>  	}
>  
>  	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
> -					sizeof(u64)))
> +					sizeof(u64) + sizeof(u32)))
>  		return 1;
>  
>  	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>  	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
> +	vcpu->arch.apf.send_pf_error = data & KVM_ASYNC_PF_SEND_ERROR;
>  
>  	kvm_async_pf_wakeup_all(vcpu);
>  
> @@ -10481,12 +10482,25 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
>  				      sizeof(reason));
>  }
>  
> -static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
> +static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token,
> +				     bool is_err)
>  {
>  	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data, token);
> +	int ret;
> +	u32 ready_flags = 0;
> +
> +	if (is_err && vcpu->arch.apf.send_pf_error)
> +		ready_flags = KVM_PV_REASON_PAGE_ERROR;
> +
> +	ret = kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> +					    &token, offset, sizeof(token));
> +	if (ret < 0)
> +		return ret;
>  
> +	offset = offsetof(struct kvm_vcpu_pv_apf_data, ready_flags);
>  	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> -					     &token, offset, sizeof(token));
> +					    &ready_flags, offset,
> +					    sizeof(ready_flags));
>  }
>  
>  static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
> @@ -10571,6 +10585,8 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  				 struct kvm_async_pf *work)
>  {
> +	bool present_injected = false;
> +
>  	struct kvm_lapic_irq irq = {
>  		.delivery_mode = APIC_DM_FIXED,
>  		.vector = vcpu->arch.apf.vec
> @@ -10584,16 +10600,18 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  
>  	if ((work->wakeup_all || work->notpresent_injected) &&
>  	    kvm_pv_async_pf_enabled(vcpu) &&
> -	    !apf_put_user_ready(vcpu, work->arch.token)) {
> +	    !apf_put_user_ready(vcpu, work->arch.token, work->error_code)) {
>  		vcpu->arch.apf.pageready_pending = true;
>  		kvm_apic_set_irq(vcpu, &irq, NULL);
> +		present_injected = true;
>  	}
>  
> -	if (work->error_code) {
> +	if (work->error_code &&
> +	    (!present_injected || !vcpu->arch.apf.send_pf_error)) {
>  		/*
> -		 * Page fault failed and we don't have a way to report
> -		 * errors back to guest yet. So save this pfn and force
> -		 * sync page fault next time.
> +		 * Page fault failed. If we did not report error back
> +		 * to guest, save this pfn and force sync page fault next
> +		 * time.
>  		 */
>  		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
>  	}

-- 
Vitaly


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-17 13:12     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 26+ messages in thread
From: Vitaly Kuznetsov @ 2020-06-17 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: wanpengli, kvm, miklos, sean.j.christopherson, linux-kernel,
	virtio-fs, pbonzini

Vivek Goyal <vgoyal@redhat.com> writes:

> As of now asynchronous page fault mecahanism assumes host will always be
> successful in resolving page fault. So there are only two states, that
> is page is not present and page is ready.
>
> If a page is backed by a file and that file has been truncated (as
> can be the case with virtio-fs), then page fault handler on host returns
> -EFAULT.
>
> As of now async page fault logic does not look at error code (-EFAULT)
> returned by get_user_pages_remote() and returns PAGE_READY to guest.
> Guest tries to access page and page fault happnes again. And this
> gets kvm into an infinite loop. (Killing host process gets kvm out of
> this loop though).
>
> This patch adds another state to async page fault logic which allows
> host to return error to guest. Once guest knows that async page fault
> can't be resolved, it can send SIGBUS to host process (if user space
> was accessing the page in question).
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/virt/kvm/cpuid.rst     |  4 +++
>  Documentation/virt/kvm/msr.rst       | 10 +++++---
>  arch/x86/include/asm/kvm_host.h      |  3 +++
>  arch/x86/include/asm/kvm_para.h      |  8 +++---
>  arch/x86/include/uapi/asm/kvm_para.h | 10 ++++++--
>  arch/x86/kernel/kvm.c                | 34 ++++++++++++++++++++-----
>  arch/x86/kvm/cpuid.c                 |  3 ++-
>  arch/x86/kvm/mmu/mmu.c               |  2 +-
>  arch/x86/kvm/x86.c                   | 38 ++++++++++++++++++++--------
>  9 files changed, 83 insertions(+), 29 deletions(-)
>
> diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> index a7dff9186bed..a4497f3a6e7f 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
>                                                async pf acknowledgment msr
>                                                0x4b564d07.
>  
> +KVM_FEATURE_ASYNC_PF_ERROR        15          paravirtualized async PF error
> +                                              can be enabled by setting bit 4
> +                                              when writing to msr 0x4b564d02
> +
>  KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
>                                                per-cpu warps are expeced in
>                                                kvmclock
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index e37a14c323d2..513eb8e0b0eb 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -202,19 +202,21 @@ data:
>  
>  		/* Used for 'page ready' events delivered via interrupt notification */
>  		__u32 token;
> -
> -		__u8 pad[56];
> +                __u32 ready_flags;
> +		__u8 pad[52];
>  		__u32 enabled;
>  	  };
>  
> -	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> +	Bits 5 of the MSR is reserved and should be zero. Bit 0 is set to 1
>  	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
>  	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
>  	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
>  	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
>  	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
>  	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
> -	CPUID.
> +	CPUID. Bit 4 enables reporting of page fault errors if hypervisor
> +        encounters errors while faulting in page. Bit 4 can only be set if
> +        KVM_FEATURE_ASYNC_PF_ERROR is present in CPUID.

Would it actually make sense to deliver the exact error from
get_user_pages() and not a binary failure/no-failure? Is the guest
interested in how exactly we failed?

>  
>  	'Page not present' events are currently always delivered as synthetic
>  	#PF exception. During delivery of these events APF CR2 register contains
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 348a73106556..ff8dbc604dbe 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
>  		gfn_t error_gfn;
> +		bool send_pf_error;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> @@ -1680,6 +1681,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>  			       struct kvm_async_pf *work);
>  void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
>  bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
> +void kvm_arch_async_page_fault_error(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
>  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
>  
>  int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bbc43e5411d9..6c738e91ca2c 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -89,8 +89,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  unsigned int kvm_arch_para_hints(void);
> -void kvm_async_pf_task_wait_schedule(u32 token);
> -void kvm_async_pf_task_wake(u32 token);
> +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode);
> +void kvm_async_pf_task_wake(u32 token, bool is_err);
>  u32 kvm_read_and_reset_apf_flags(void);
>  void kvm_disable_steal_time(void);
>  bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
> @@ -120,8 +120,8 @@ static inline void kvm_spinlock_init(void)
>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  
>  #else /* CONFIG_KVM_GUEST */
> -#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
> -#define kvm_async_pf_task_wake(T) do {} while(0)
> +#define kvm_async_pf_task_wait_schedule(T, U) do {} while(0)
> +#define kvm_async_pf_task_wake(T, I) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
>  {
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 812e9b4c1114..b43fd2ddc949 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -32,6 +32,7 @@
>  #define KVM_FEATURE_POLL_CONTROL	12
>  #define KVM_FEATURE_PV_SCHED_YIELD	13
>  #define KVM_FEATURE_ASYNC_PF_INT	14
> +#define KVM_FEATURE_ASYNC_PF_ERROR	15
>  
>  #define KVM_HINTS_REALTIME      0
>  
> @@ -85,6 +86,7 @@ struct kvm_clock_pairing {
>  #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
>  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
>  #define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
> +#define KVM_ASYNC_PF_SEND_ERROR			(1 << 4)
>  
>  /* MSR_KVM_ASYNC_PF_INT */
>  #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
> @@ -119,14 +121,18 @@ struct kvm_mmu_op_release_pt {
>  #define KVM_PV_REASON_PAGE_NOT_PRESENT 1
>  #define KVM_PV_REASON_PAGE_READY 2
>  
> +
> +/* Bit flags for ready_flags field */
> +#define KVM_PV_REASON_PAGE_ERROR 1
> +
>  struct kvm_vcpu_pv_apf_data {
>  	/* Used for 'page not present' events delivered via #PF */
>  	__u32 flags;
>  
>  	/* Used for 'page ready' events delivered via interrupt notification */
>  	__u32 token;
> -
> -	__u8 pad[56];
> +	__u32 ready_flags;
> +	__u8 pad[52];
>  	__u32 enabled;
>  };
>  
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index ff95429d206b..2ee9c9d971ae 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,6 +75,7 @@ struct kvm_task_sleep_node {
>  	struct swait_queue_head wq;
>  	u32 token;
>  	int cpu;
> +	bool is_err;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -97,7 +98,14 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>  	return NULL;
>  }
>  
> -static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> +static void handle_async_pf_error(bool user_mode)
> +{
> +	if (user_mode)
> +		send_sig_info(SIGBUS, SEND_SIG_PRIV, current);
> +}
> +
> +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n,
> +				    bool user_mode)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -107,6 +115,8 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
>  	e = _find_apf_task(b, token);
>  	if (e) {
>  		/* dummy entry exist -> wake up was delivered ahead of PF */
> +		if (e->is_err)
> +			handle_async_pf_error(user_mode);
>  		hlist_del(&e->link);
>  		raw_spin_unlock(&b->lock);
>  		kfree(e);
> @@ -128,14 +138,14 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
>   * Invoked from the async pagefault handling code or from the VM exit page
>   * fault handler. In both cases RCU is watching.
>   */
> -void kvm_async_pf_task_wait_schedule(u32 token)
> +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode)
>  {
>  	struct kvm_task_sleep_node n;
>  	DECLARE_SWAITQUEUE(wait);
>  
>  	lockdep_assert_irqs_disabled();
>  
> -	if (!kvm_async_pf_queue_task(token, &n))
> +	if (!kvm_async_pf_queue_task(token, &n, user_mode))
>  		return;
>  
>  	for (;;) {
> @@ -148,6 +158,8 @@ void kvm_async_pf_task_wait_schedule(u32 token)
>  		local_irq_disable();
>  	}
>  	finish_swait(&n.wq, &wait);
> +	if (n.is_err)
> +		handle_async_pf_error(user_mode);
>  }
>  EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>  
> @@ -177,7 +189,7 @@ static void apf_task_wake_all(void)
>  	}
>  }
>  
> -void kvm_async_pf_task_wake(u32 token)
> +void kvm_async_pf_task_wake(u32 token, bool is_err)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -208,9 +220,11 @@ void kvm_async_pf_task_wake(u32 token)
>  		}
>  		n->token = token;
>  		n->cpu = smp_processor_id();
> +		n->is_err = is_err;
>  		init_swait_queue_head(&n->wq);
>  		hlist_add_head(&n->link, &b->list);
>  	} else {
> +		n->is_err = is_err;
>  		apf_task_wake_one(n);
>  	}
>  	raw_spin_unlock(&b->lock);
> @@ -256,14 +270,15 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  		panic("Host injected async #PF in kernel mode\n");
>  
>  	/* Page is swapped out by the host. */
> -	kvm_async_pf_task_wait_schedule(token);
> +	kvm_async_pf_task_wait_schedule(token, user_mode(regs));
>  	return true;
>  }
>  NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>  
>  __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
>  {
> -	u32 token;
> +	u32 token, ready_flags;
> +	bool is_err;
>  
>  	entering_ack_irq();
>  
> @@ -271,7 +286,9 @@ __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
>  
>  	if (__this_cpu_read(apf_reason.enabled)) {
>  		token = __this_cpu_read(apf_reason.token);
> -		kvm_async_pf_task_wake(token);
> +		ready_flags = __this_cpu_read(apf_reason.ready_flags);
> +		is_err = ready_flags & KVM_PV_REASON_PAGE_ERROR;
> +		kvm_async_pf_task_wake(token, is_err);
>  		__this_cpu_write(apf_reason.token, 0);
>  		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
>  	}
> @@ -335,6 +352,9 @@ static void kvm_guest_cpu_init(void)
>  
>  		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
>  
> +		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_ERROR))
> +			pa |= KVM_ASYNC_PF_SEND_ERROR;
> +
>  		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
>  		__this_cpu_write(apf_reason.enabled, 1);
>  		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 253b8e875ccd..f811f36e0c24 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  			     (1 << KVM_FEATURE_PV_SEND_IPI) |
>  			     (1 << KVM_FEATURE_POLL_CONTROL) |
>  			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
> -			     (1 << KVM_FEATURE_ASYNC_PF_INT);
> +			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
> +			     (1 << KVM_FEATURE_ASYNC_PF_ERROR);
>  
>  		if (sched_info_on())
>  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e207900ab303..634182bb07c7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4175,7 +4175,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		vcpu->arch.apf.host_apf_flags = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait_schedule(fault_address);
> +		kvm_async_pf_task_wait_schedule(fault_address, 0);
>  		local_irq_enable();
>  	} else {
>  		WARN_ONCE(1, "Unexpected host async PF flags: %x\n", flags);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4c5205434b9c..c3b2097f2eaa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2690,8 +2690,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>  {
>  	gpa_t gpa = data & ~0x3f;
>  
> -	/* Bits 4:5 are reserved, Should be zero */
> -	if (data & 0x30)
> +	/* Bits 5 is reserved, Should be zero */
> +	if (data & 0x20)
>  		return 1;
>  
>  	vcpu->arch.apf.msr_en_val = data;
> @@ -2703,11 +2703,12 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>  	}
>  
>  	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
> -					sizeof(u64)))
> +					sizeof(u64) + sizeof(u32)))
>  		return 1;
>  
>  	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>  	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
> +	vcpu->arch.apf.send_pf_error = data & KVM_ASYNC_PF_SEND_ERROR;
>  
>  	kvm_async_pf_wakeup_all(vcpu);
>  
> @@ -10481,12 +10482,25 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
>  				      sizeof(reason));
>  }
>  
> -static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
> +static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token,
> +				     bool is_err)
>  {
>  	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data, token);
> +	int ret;
> +	u32 ready_flags = 0;
> +
> +	if (is_err && vcpu->arch.apf.send_pf_error)
> +		ready_flags = KVM_PV_REASON_PAGE_ERROR;
> +
> +	ret = kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> +					    &token, offset, sizeof(token));
> +	if (ret < 0)
> +		return ret;
>  
> +	offset = offsetof(struct kvm_vcpu_pv_apf_data, ready_flags);
>  	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> -					     &token, offset, sizeof(token));
> +					    &ready_flags, offset,
> +					    sizeof(ready_flags));
>  }
>  
>  static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
> @@ -10571,6 +10585,8 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  				 struct kvm_async_pf *work)
>  {
> +	bool present_injected = false;
> +
>  	struct kvm_lapic_irq irq = {
>  		.delivery_mode = APIC_DM_FIXED,
>  		.vector = vcpu->arch.apf.vec
> @@ -10584,16 +10600,18 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  
>  	if ((work->wakeup_all || work->notpresent_injected) &&
>  	    kvm_pv_async_pf_enabled(vcpu) &&
> -	    !apf_put_user_ready(vcpu, work->arch.token)) {
> +	    !apf_put_user_ready(vcpu, work->arch.token, work->error_code)) {
>  		vcpu->arch.apf.pageready_pending = true;
>  		kvm_apic_set_irq(vcpu, &irq, NULL);
> +		present_injected = true;
>  	}
>  
> -	if (work->error_code) {
> +	if (work->error_code &&
> +	    (!present_injected || !vcpu->arch.apf.send_pf_error)) {
>  		/*
> -		 * Page fault failed and we don't have a way to report
> -		 * errors back to guest yet. So save this pfn and force
> -		 * sync page fault next time.
> +		 * Page fault failed. If we did not report error back
> +		 * to guest, save this pfn and force sync page fault next
> +		 * time.
>  		 */
>  		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
>  	}

-- 
Vitaly


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-17 13:12     ` [Virtio-fs] " Vitaly Kuznetsov
@ 2020-06-17 18:32       ` Sean Christopherson
  -1 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2020-06-17 18:32 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Vivek Goyal, virtio-fs, miklos, stefanha, dgilbert, pbonzini,
	wanpengli, kvm, linux-kernel

On Wed, Jun 17, 2020 at 03:12:03PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > As of now asynchronous page fault mecahanism assumes host will always be
> > successful in resolving page fault. So there are only two states, that
> > is page is not present and page is ready.
> >
> > If a page is backed by a file and that file has been truncated (as
> > can be the case with virtio-fs), then page fault handler on host returns
> > -EFAULT.
> >
> > As of now async page fault logic does not look at error code (-EFAULT)
> > returned by get_user_pages_remote() and returns PAGE_READY to guest.
> > Guest tries to access page and page fault happnes again. And this
> > gets kvm into an infinite loop. (Killing host process gets kvm out of
> > this loop though).

Isn't this already fixed by patch 1/3 "kvm,x86: Force sync fault if previous
attempts failed"?  If it isn't, it should be, i.e. we should fix KVM before
adding what are effectively optimizations on top.   And, it's not clear that
the optimizations are necessary, e.g. I assume the virtio-fs truncation
scenario is relatively uncommon, i.e. not performance sensitive?

> >
> > This patch adds another state to async page fault logic which allows
> > host to return error to guest. Once guest knows that async page fault
> > can't be resolved, it can send SIGBUS to host process (if user space

I assume this is supposed to be "it can send SIGBUS to guest process"?
Otherwise none of this makes sense (to me).

> > was accessing the page in question).

Allowing the guest to opt-in to intercepting host page allocation failures
feels wrong, and fragile.  KVM can't possibly know whether an allocation
failure is something that should be forwarded to the guest, as KVM doesn't
know the physical backing for any given hva/gfn, e.g. the error could be
due to a physical device failure or a configuration issue.  Relying on the
async #PF mechanism to prevent allocation failures from crashing the guest
is fragile because there is no guarantee that a #PF can be async.

IMO, the virtio-fs truncation use case should first be addressed in a way
that requires explicit userspace intervention, e.g. by enhancing
kvm_handle_bad_page() to provide the necessary information to userspace so
that userspace can reflect select errors into the guest.  The reflection
could piggyback whatever vector is used by async page faults (#PF or #VE),
but would not be an async page fault per se.  If an async #PF happens to
encounter an allocation failure, it would naturally fall back to the
synchronous path (provided by patch 1/3) and the synchronous path would
automagically handle the error as above.

In other words, I think the guest should be able to enable "error handling"
support without first enabling async #PF.  From a functional perspective it
probably won't change a whole lot, but it would hopefully force us to
concoct an overall "paravirt page fault" design as opposed to simply async
#PF v2.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-17 18:32       ` Sean Christopherson
  0 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2020-06-17 18:32 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: wanpengli, kvm, miklos, linux-kernel, virtio-fs, pbonzini, Vivek Goyal

On Wed, Jun 17, 2020 at 03:12:03PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > As of now asynchronous page fault mecahanism assumes host will always be
> > successful in resolving page fault. So there are only two states, that
> > is page is not present and page is ready.
> >
> > If a page is backed by a file and that file has been truncated (as
> > can be the case with virtio-fs), then page fault handler on host returns
> > -EFAULT.
> >
> > As of now async page fault logic does not look at error code (-EFAULT)
> > returned by get_user_pages_remote() and returns PAGE_READY to guest.
> > Guest tries to access page and page fault happnes again. And this
> > gets kvm into an infinite loop. (Killing host process gets kvm out of
> > this loop though).

Isn't this already fixed by patch 1/3 "kvm,x86: Force sync fault if previous
attempts failed"?  If it isn't, it should be, i.e. we should fix KVM before
adding what are effectively optimizations on top.   And, it's not clear that
the optimizations are necessary, e.g. I assume the virtio-fs truncation
scenario is relatively uncommon, i.e. not performance sensitive?

> >
> > This patch adds another state to async page fault logic which allows
> > host to return error to guest. Once guest knows that async page fault
> > can't be resolved, it can send SIGBUS to host process (if user space

I assume this is supposed to be "it can send SIGBUS to guest process"?
Otherwise none of this makes sense (to me).

> > was accessing the page in question).

Allowing the guest to opt-in to intercepting host page allocation failures
feels wrong, and fragile.  KVM can't possibly know whether an allocation
failure is something that should be forwarded to the guest, as KVM doesn't
know the physical backing for any given hva/gfn, e.g. the error could be
due to a physical device failure or a configuration issue.  Relying on the
async #PF mechanism to prevent allocation failures from crashing the guest
is fragile because there is no guarantee that a #PF can be async.

IMO, the virtio-fs truncation use case should first be addressed in a way
that requires explicit userspace intervention, e.g. by enhancing
kvm_handle_bad_page() to provide the necessary information to userspace so
that userspace can reflect select errors into the guest.  The reflection
could piggyback whatever vector is used by async page faults (#PF or #VE),
but would not be an async page fault per se.  If an async #PF happens to
encounter an allocation failure, it would naturally fall back to the
synchronous path (provided by patch 1/3) and the synchronous path would
automagically handle the error as above.

In other words, I think the guest should be able to enable "error handling"
support without first enabling async #PF.  From a functional perspective it
probably won't change a whole lot, but it would hopefully force us to
concoct an overall "paravirt page fault" design as opposed to simply async
#PF v2.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] kvm,x86: Force sync fault if previous attempts failed
  2020-06-17 13:02     ` [Virtio-fs] [PATCH 1/3] kvm, x86: " Vitaly Kuznetsov
@ 2020-06-17 19:58       ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-17 19:58 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: virtio-fs, miklos, stefanha, dgilbert, pbonzini, wanpengli,
	sean.j.christopherson, kvm, linux-kernel

On Wed, Jun 17, 2020 at 03:02:10PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > Page fault error handling behavior in kvm seems little inconsistent when
> > page fault reports error. If we are doing fault synchronously
> > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> > exit to user space and qemu reports error, "error: kvm run failed Bad address".
> >
> > But if we are doing async page fault, then async_pf_execute() will simply
> > ignore the error reported by get_user_pages_remote(). It is assumed that
> > page fault was successful and either a page ready event is injected in
> > guest or guest is brought out of artificial halt state and run again.
> > In both the cases when guest retries the instruction, it takes exit
> > again as page fault was not successful in previous attempt. And then
> > this infinite loop continues forever.
> >
> 
> Indeed, we always assume get_user_pages_remote() will fetch the page we
> need. 
> 
> > This patch tries to make this behavior consistent. That is instead of
> > getting into infinite loop of retrying page fault, exit to user space
> > and stop VM if page fault error happens. This can be improved by
> > injecting errors in guest when it is allowed. Later patches can
> > inject error when a process in guest triggered page fault and
> > in that case guest process will receive SIGBUS. Currently we don't
> > have a way to inject errors when guest is in kernel mode. Once we
> > have race free method to do so, we should be able to inject errors
> > and guest can do fixup_exception() if caller set it up so (virtio-fs).
> >
> > When async pf encounters error then save that pfn and when next time
> > guest retries, do a sync fault instead of async fault. So that if error
> > is encountered, we exit to qemu and avoid infinite loop.
> >
> > As of now only one error pfn is stored and that means it could be
> > overwritten before next a retry from guest happens. But this is
> > just a hint and if we miss it, some other time we will catch it.
> > If this becomes an issue, we could maintain an array of error
> > gfn later to help ease the issue.
> 
> With a single GFN stored we can probably get in the same infinite loop
> you describe when several processes try to access different unavailable
> GFNs.

We could probably have an array and store bunch of error GFNs. But I
am not seeing much value in adding that complexity yet (until and
unless there are easy ways to reproduce). Following is my thought
process.

- error gfn is just a hint to force sync fault. Even if we miss it,
  guest will retry and at some point of time it will be forced to
  do sync fault.

- If page fault happened in guest user space, we will send error back
  and will not save or rely on error gfn.

- If page fault happened in guest kernel, then we have blocked
  async page fault and that means no scheduling (except if interrupt
  happened in between and that resulted in some scheduling).

I will test it with multiple processes doing I/O to same file and
then truncate this file and see if all qemu exits pretty soon or
not.

> Also, is the condition "GFN is unavailable" temporary or
> permanent?

I think it depends on usage. For the nvdimm use case, if file
backing that GFN gets truncated, then it is sort of permanent
failure. But for the case  of virtio-fs, it is possible to reuse
that GFN to map something else dynamically and once it maps
something else, then fault will be successful again.

> I guess it's the former but 'error_gfn' is set permanently
> (assuming no other GFN will overwrite it).

Other GFN can overwrite it. Nothing blocks that. So it is not
permanent.

Also if we have set error_gfn, on next retry qemu will fail. So,
leaving it behind should not be a problem. Even if failure was
temporary for some reason and upon next retry page fault is successful,
that's fine too. All we did was that did a sync fault instead of
async one.

I could look into clearing error_gfn on successful fault. I think
there are bunch of success path, that's why I was avoiding going
in that direction.

> 
> What if we do the following instead: add one more field to APF data
> indicating an error. The guest will kill the corresponding process
> instead of resuming it then.

That's what second patch is doing. I added a field pageready_flag
and used bit 0 to indicate error. But that works only if fault
was triggered by user space otherwise we are not following
async pf protocol.

If fault was triggered by kernel, then error needs to injected
synchronously so that kernel can call fixup_exception(). That's
more complicated and a future TODO item. I think for now, I
think stopping VM probably is better option than looping infinitely.

This first patch I wanted to do to make error behavior uniform and
to agree that it makes sense. That is if page fault error happens,
then either we shoudl inject error back into guest(if it is allowed)
otherwise, exit to user space. Retrying infinitely is not a good
option.

> 
> Also, with KVM_ASYNC_PF_SEND_ALWAYS being deprecated and assuming we
> switch to #VE, under which curcumstances we'll be unable to handle page
> fault asyncronously? Apart from overflowing the async PF queue, are
> there any other scenarios when it still makes sense to keep the guest
> alive? 

I think we should be able to handle it in most of the cases and
inject error back into guest.

> 
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/mmu.h              |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          |  2 +-
> >  arch/x86/kvm/x86.c              | 19 +++++++++++++++++--
> >  include/linux/kvm_host.h        |  1 +
> >  virt/kvm/async_pf.c             |  8 ++++++--
> >  6 files changed, 27 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 938497a6ebd7..348a73106556 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
> >  		unsigned long nested_apf_token;
> >  		bool delivery_as_pf_vmexit;
> >  		bool pageready_pending;
> > +		gfn_t error_gfn;
> >  	} apf;
> >  
> >  	/* OSVW MSRs (AMD only) */
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 0ad06bfe2c2c..6fa085ff07a3 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
> >  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
> >  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> >  			     bool accessed_dirty, gpa_t new_eptp);
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 15984dfde427..e207900ab303 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> >  	if (!async)
> >  		return false; /* *pfn has correct page already */
> >  
> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
> >  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
> >  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> >  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 697d1b273a2f..4c5205434b9c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10513,7 +10513,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
> >  	return true;
> >  }
> >  
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
> >  {
> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
> >  		     kvm_event_needs_reinjection(vcpu) ||
> > @@ -10527,7 +10527,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >  	 * If interrupts are off we cannot even use an artificial
> >  	 * halt state.
> >  	 */
> > -	return kvm_arch_interrupt_allowed(vcpu);
> > +	if (!kvm_arch_interrupt_allowed(vcpu))
> > +		return false;
> > +
> > +	if (vcpu->arch.apf.error_gfn == gfn)
> > +		return false;
> > +
> > +	return true;
> >  }
> >  
> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> > @@ -10583,6 +10589,15 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  		kvm_apic_set_irq(vcpu, &irq, NULL);
> >  	}
> >  
> > +	if (work->error_code) {
> > +		/*
> > +		 * Page fault failed and we don't have a way to report
> > +		 * errors back to guest yet. So save this pfn and force
> > +		 * sync page fault next time.
> > +		 */
> > +		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
> > +	}
> > +
> >  	vcpu->arch.apf.halted = false;
> >  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 216cdb6581e2..b8558334b184 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -207,6 +207,7 @@ struct kvm_async_pf {
> >  	struct kvm_arch_async_pf arch;
> >  	bool   wakeup_all;
> >  	bool notpresent_injected;
> > +	int error_code;
> >  };
> >  
> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index a36828fbf40a..6b30374a4de1 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -52,6 +52,7 @@ static void async_pf_execute(struct work_struct *work)
> >  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> >  	int locked = 1;
> >  	bool first;
> > +	long ret;
> >  
> >  	might_sleep();
> >  
> > @@ -61,11 +62,14 @@ static void async_pf_execute(struct work_struct *work)
> >  	 * access remotely.
> >  	 */
> >  	down_read(&mm->mmap_sem);
> > -	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> > -			&locked);
> > +	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> > +				    &locked);
> >  	if (locked)
> >  		up_read(&mm->mmap_sem);
> >  
> > +	if (ret < 0)
> > +		apf->error_code = ret;
> > +
> >  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> >  		kvm_arch_async_page_present(vcpu, apf);
> 
> -- 
> Vitaly
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 1/3] kvm, x86: Force sync fault if previous attempts failed
@ 2020-06-17 19:58       ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-17 19:58 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: wanpengli, kvm, miklos, sean.j.christopherson, linux-kernel,
	virtio-fs, pbonzini

On Wed, Jun 17, 2020 at 03:02:10PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > Page fault error handling behavior in kvm seems little inconsistent when
> > page fault reports error. If we are doing fault synchronously
> > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> > exit to user space and qemu reports error, "error: kvm run failed Bad address".
> >
> > But if we are doing async page fault, then async_pf_execute() will simply
> > ignore the error reported by get_user_pages_remote(). It is assumed that
> > page fault was successful and either a page ready event is injected in
> > guest or guest is brought out of artificial halt state and run again.
> > In both the cases when guest retries the instruction, it takes exit
> > again as page fault was not successful in previous attempt. And then
> > this infinite loop continues forever.
> >
> 
> Indeed, we always assume get_user_pages_remote() will fetch the page we
> need. 
> 
> > This patch tries to make this behavior consistent. That is instead of
> > getting into infinite loop of retrying page fault, exit to user space
> > and stop VM if page fault error happens. This can be improved by
> > injecting errors in guest when it is allowed. Later patches can
> > inject error when a process in guest triggered page fault and
> > in that case guest process will receive SIGBUS. Currently we don't
> > have a way to inject errors when guest is in kernel mode. Once we
> > have race free method to do so, we should be able to inject errors
> > and guest can do fixup_exception() if caller set it up so (virtio-fs).
> >
> > When async pf encounters error then save that pfn and when next time
> > guest retries, do a sync fault instead of async fault. So that if error
> > is encountered, we exit to qemu and avoid infinite loop.
> >
> > As of now only one error pfn is stored and that means it could be
> > overwritten before next a retry from guest happens. But this is
> > just a hint and if we miss it, some other time we will catch it.
> > If this becomes an issue, we could maintain an array of error
> > gfn later to help ease the issue.
> 
> With a single GFN stored we can probably get in the same infinite loop
> you describe when several processes try to access different unavailable
> GFNs.

We could probably have an array and store bunch of error GFNs. But I
am not seeing much value in adding that complexity yet (until and
unless there are easy ways to reproduce). Following is my thought
process.

- error gfn is just a hint to force sync fault. Even if we miss it,
  guest will retry and at some point of time it will be forced to
  do sync fault.

- If page fault happened in guest user space, we will send error back
  and will not save or rely on error gfn.

- If page fault happened in guest kernel, then we have blocked
  async page fault and that means no scheduling (except if interrupt
  happened in between and that resulted in some scheduling).

I will test it with multiple processes doing I/O to same file and
then truncate this file and see if all qemu exits pretty soon or
not.

> Also, is the condition "GFN is unavailable" temporary or
> permanent?

I think it depends on usage. For the nvdimm use case, if file
backing that GFN gets truncated, then it is sort of permanent
failure. But for the case  of virtio-fs, it is possible to reuse
that GFN to map something else dynamically and once it maps
something else, then fault will be successful again.

> I guess it's the former but 'error_gfn' is set permanently
> (assuming no other GFN will overwrite it).

Other GFN can overwrite it. Nothing blocks that. So it is not
permanent.

Also if we have set error_gfn, on next retry qemu will fail. So,
leaving it behind should not be a problem. Even if failure was
temporary for some reason and upon next retry page fault is successful,
that's fine too. All we did was that did a sync fault instead of
async one.

I could look into clearing error_gfn on successful fault. I think
there are bunch of success path, that's why I was avoiding going
in that direction.

> 
> What if we do the following instead: add one more field to APF data
> indicating an error. The guest will kill the corresponding process
> instead of resuming it then.

That's what second patch is doing. I added a field pageready_flag
and used bit 0 to indicate error. But that works only if fault
was triggered by user space otherwise we are not following
async pf protocol.

If fault was triggered by kernel, then error needs to injected
synchronously so that kernel can call fixup_exception(). That's
more complicated and a future TODO item. I think for now, I
think stopping VM probably is better option than looping infinitely.

This first patch I wanted to do to make error behavior uniform and
to agree that it makes sense. That is if page fault error happens,
then either we shoudl inject error back into guest(if it is allowed)
otherwise, exit to user space. Retrying infinitely is not a good
option.

> 
> Also, with KVM_ASYNC_PF_SEND_ALWAYS being deprecated and assuming we
> switch to #VE, under which curcumstances we'll be unable to handle page
> fault asyncronously? Apart from overflowing the async PF queue, are
> there any other scenarios when it still makes sense to keep the guest
> alive? 

I think we should be able to handle it in most of the cases and
inject error back into guest.

> 
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/mmu.h              |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          |  2 +-
> >  arch/x86/kvm/x86.c              | 19 +++++++++++++++++--
> >  include/linux/kvm_host.h        |  1 +
> >  virt/kvm/async_pf.c             |  8 ++++++--
> >  6 files changed, 27 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 938497a6ebd7..348a73106556 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
> >  		unsigned long nested_apf_token;
> >  		bool delivery_as_pf_vmexit;
> >  		bool pageready_pending;
> > +		gfn_t error_gfn;
> >  	} apf;
> >  
> >  	/* OSVW MSRs (AMD only) */
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 0ad06bfe2c2c..6fa085ff07a3 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
> >  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
> >  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> >  			     bool accessed_dirty, gpa_t new_eptp);
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 15984dfde427..e207900ab303 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> >  	if (!async)
> >  		return false; /* *pfn has correct page already */
> >  
> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
> >  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
> >  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
> >  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 697d1b273a2f..4c5205434b9c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10513,7 +10513,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
> >  	return true;
> >  }
> >  
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
> >  {
> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
> >  		     kvm_event_needs_reinjection(vcpu) ||
> > @@ -10527,7 +10527,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >  	 * If interrupts are off we cannot even use an artificial
> >  	 * halt state.
> >  	 */
> > -	return kvm_arch_interrupt_allowed(vcpu);
> > +	if (!kvm_arch_interrupt_allowed(vcpu))
> > +		return false;
> > +
> > +	if (vcpu->arch.apf.error_gfn == gfn)
> > +		return false;
> > +
> > +	return true;
> >  }
> >  
> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> > @@ -10583,6 +10589,15 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  		kvm_apic_set_irq(vcpu, &irq, NULL);
> >  	}
> >  
> > +	if (work->error_code) {
> > +		/*
> > +		 * Page fault failed and we don't have a way to report
> > +		 * errors back to guest yet. So save this pfn and force
> > +		 * sync page fault next time.
> > +		 */
> > +		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
> > +	}
> > +
> >  	vcpu->arch.apf.halted = false;
> >  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 216cdb6581e2..b8558334b184 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -207,6 +207,7 @@ struct kvm_async_pf {
> >  	struct kvm_arch_async_pf arch;
> >  	bool   wakeup_all;
> >  	bool notpresent_injected;
> > +	int error_code;
> >  };
> >  
> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index a36828fbf40a..6b30374a4de1 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -52,6 +52,7 @@ static void async_pf_execute(struct work_struct *work)
> >  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> >  	int locked = 1;
> >  	bool first;
> > +	long ret;
> >  
> >  	might_sleep();
> >  
> > @@ -61,11 +62,14 @@ static void async_pf_execute(struct work_struct *work)
> >  	 * access remotely.
> >  	 */
> >  	down_read(&mm->mmap_sem);
> > -	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> > -			&locked);
> > +	ret = get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
> > +				    &locked);
> >  	if (locked)
> >  		up_read(&mm->mmap_sem);
> >  
> > +	if (ret < 0)
> > +		apf->error_code = ret;
> > +
> >  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> >  		kvm_arch_async_page_present(vcpu, apf);
> 
> -- 
> Vitaly
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-17 13:12     ` [Virtio-fs] " Vitaly Kuznetsov
@ 2020-06-17 20:33       ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-17 20:33 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: virtio-fs, miklos, stefanha, dgilbert, pbonzini, wanpengli,
	sean.j.christopherson, kvm, linux-kernel

On Wed, Jun 17, 2020 at 03:12:03PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > As of now asynchronous page fault mecahanism assumes host will always be
> > successful in resolving page fault. So there are only two states, that
> > is page is not present and page is ready.
> >
> > If a page is backed by a file and that file has been truncated (as
> > can be the case with virtio-fs), then page fault handler on host returns
> > -EFAULT.
> >
> > As of now async page fault logic does not look at error code (-EFAULT)
> > returned by get_user_pages_remote() and returns PAGE_READY to guest.
> > Guest tries to access page and page fault happnes again. And this
> > gets kvm into an infinite loop. (Killing host process gets kvm out of
> > this loop though).
> >
> > This patch adds another state to async page fault logic which allows
> > host to return error to guest. Once guest knows that async page fault
> > can't be resolved, it can send SIGBUS to host process (if user space
> > was accessing the page in question).
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/virt/kvm/cpuid.rst     |  4 +++
> >  Documentation/virt/kvm/msr.rst       | 10 +++++---
> >  arch/x86/include/asm/kvm_host.h      |  3 +++
> >  arch/x86/include/asm/kvm_para.h      |  8 +++---
> >  arch/x86/include/uapi/asm/kvm_para.h | 10 ++++++--
> >  arch/x86/kernel/kvm.c                | 34 ++++++++++++++++++++-----
> >  arch/x86/kvm/cpuid.c                 |  3 ++-
> >  arch/x86/kvm/mmu/mmu.c               |  2 +-
> >  arch/x86/kvm/x86.c                   | 38 ++++++++++++++++++++--------
> >  9 files changed, 83 insertions(+), 29 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> > index a7dff9186bed..a4497f3a6e7f 100644
> > --- a/Documentation/virt/kvm/cpuid.rst
> > +++ b/Documentation/virt/kvm/cpuid.rst
> > @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
> >                                                async pf acknowledgment msr
> >                                                0x4b564d07.
> >  
> > +KVM_FEATURE_ASYNC_PF_ERROR        15          paravirtualized async PF error
> > +                                              can be enabled by setting bit 4
> > +                                              when writing to msr 0x4b564d02
> > +
> >  KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
> >                                                per-cpu warps are expeced in
> >                                                kvmclock
> > diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> > index e37a14c323d2..513eb8e0b0eb 100644
> > --- a/Documentation/virt/kvm/msr.rst
> > +++ b/Documentation/virt/kvm/msr.rst
> > @@ -202,19 +202,21 @@ data:
> >  
> >  		/* Used for 'page ready' events delivered via interrupt notification */
> >  		__u32 token;
> > -
> > -		__u8 pad[56];
> > +                __u32 ready_flags;
> > +		__u8 pad[52];
> >  		__u32 enabled;
> >  	  };
> >  
> > -	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> > +	Bits 5 of the MSR is reserved and should be zero. Bit 0 is set to 1
> >  	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
> >  	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
> >  	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
> >  	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
> >  	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
> >  	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
> > -	CPUID.
> > +	CPUID. Bit 4 enables reporting of page fault errors if hypervisor
> > +        encounters errors while faulting in page. Bit 4 can only be set if
> > +        KVM_FEATURE_ASYNC_PF_ERROR is present in CPUID.
> 
> Would it actually make sense to deliver the exact error from
> get_user_pages() and not a binary failure/no-failure? Is the guest
> interested in how exactly we failed?

I was thinking about it but could not find a use case. I am not sure
what will guest do with this error code.

If error happened in user space because mmap() tried to access portion
of file which got truncated on host then we probably should treat
it same way as if a process on host gets SIGBUS. I think they get
si_code as BUS_ADRERR and also the faulting address. This patch as
of now does not send faulting gva because I am not sure how to get
that. In the past I had proposed a patch where I was sending gva
also as part of error.

https://lore.kernel.org/kvm/20200331194011.24834-3-vgoyal@redhat.com/

I am not sure if that's the right way to get to GVA. Is it true
that following will give me guest virtual address. It seemed to
work for me in the past.

+	if (exit_qualification | EPT_VIOLATION_GLA_VALID) {
+		gva = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	}

If error happened in kernel, we will just simply call fixup_exception()
or panic().

So while sending error back might not hurt, I don't have a good use
case in mind yet. So I thought of leaving it as a future item when
somebody cares about it.

> 
> >  
> >  	'Page not present' events are currently always delivered as synthetic
> >  	#PF exception. During delivery of these events APF CR2 register contains
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 348a73106556..ff8dbc604dbe 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
> >  		bool delivery_as_pf_vmexit;
> >  		bool pageready_pending;
> >  		gfn_t error_gfn;
> > +		bool send_pf_error;
> >  	} apf;
> >  
> >  	/* OSVW MSRs (AMD only) */
> > @@ -1680,6 +1681,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> >  			       struct kvm_async_pf *work);
> >  void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
> >  bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
> > +void kvm_arch_async_page_fault_error(struct kvm_vcpu *vcpu,
> > +				     struct kvm_async_pf *work);
> >  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  
> >  int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index bbc43e5411d9..6c738e91ca2c 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -89,8 +89,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> >  bool kvm_para_available(void);
> >  unsigned int kvm_arch_para_features(void);
> >  unsigned int kvm_arch_para_hints(void);
> > -void kvm_async_pf_task_wait_schedule(u32 token);
> > -void kvm_async_pf_task_wake(u32 token);
> > +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode);
> > +void kvm_async_pf_task_wake(u32 token, bool is_err);
> >  u32 kvm_read_and_reset_apf_flags(void);
> >  void kvm_disable_steal_time(void);
> >  bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
> > @@ -120,8 +120,8 @@ static inline void kvm_spinlock_init(void)
> >  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
> >  
> >  #else /* CONFIG_KVM_GUEST */
> > -#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
> > -#define kvm_async_pf_task_wake(T) do {} while(0)
> > +#define kvm_async_pf_task_wait_schedule(T, U) do {} while(0)
> > +#define kvm_async_pf_task_wake(T, I) do {} while(0)
> >  
> >  static inline bool kvm_para_available(void)
> >  {
> > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > index 812e9b4c1114..b43fd2ddc949 100644
> > --- a/arch/x86/include/uapi/asm/kvm_para.h
> > +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > @@ -32,6 +32,7 @@
> >  #define KVM_FEATURE_POLL_CONTROL	12
> >  #define KVM_FEATURE_PV_SCHED_YIELD	13
> >  #define KVM_FEATURE_ASYNC_PF_INT	14
> > +#define KVM_FEATURE_ASYNC_PF_ERROR	15
> >  
> >  #define KVM_HINTS_REALTIME      0
> >  
> > @@ -85,6 +86,7 @@ struct kvm_clock_pairing {
> >  #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
> >  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
> >  #define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
> > +#define KVM_ASYNC_PF_SEND_ERROR			(1 << 4)
> >  
> >  /* MSR_KVM_ASYNC_PF_INT */
> >  #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
> > @@ -119,14 +121,18 @@ struct kvm_mmu_op_release_pt {
> >  #define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> >  #define KVM_PV_REASON_PAGE_READY 2
> >  
> > +
> > +/* Bit flags for ready_flags field */
> > +#define KVM_PV_REASON_PAGE_ERROR 1
> > +
> >  struct kvm_vcpu_pv_apf_data {
> >  	/* Used for 'page not present' events delivered via #PF */
> >  	__u32 flags;
> >  
> >  	/* Used for 'page ready' events delivered via interrupt notification */
> >  	__u32 token;
> > -
> > -	__u8 pad[56];
> > +	__u32 ready_flags;
> > +	__u8 pad[52];
> >  	__u32 enabled;
> >  };
> >  
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index ff95429d206b..2ee9c9d971ae 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -75,6 +75,7 @@ struct kvm_task_sleep_node {
> >  	struct swait_queue_head wq;
> >  	u32 token;
> >  	int cpu;
> > +	bool is_err;
> >  };
> >  
> >  static struct kvm_task_sleep_head {
> > @@ -97,7 +98,14 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >  	return NULL;
> >  }
> >  
> > -static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> > +static void handle_async_pf_error(bool user_mode)
> > +{
> > +	if (user_mode)
> > +		send_sig_info(SIGBUS, SEND_SIG_PRIV, current);
> > +}
> > +
> > +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n,
> > +				    bool user_mode)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -107,6 +115,8 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> >  	e = _find_apf_task(b, token);
> >  	if (e) {
> >  		/* dummy entry exist -> wake up was delivered ahead of PF */
> > +		if (e->is_err)
> > +			handle_async_pf_error(user_mode);
> >  		hlist_del(&e->link);
> >  		raw_spin_unlock(&b->lock);
> >  		kfree(e);
> > @@ -128,14 +138,14 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> >   * Invoked from the async pagefault handling code or from the VM exit page
> >   * fault handler. In both cases RCU is watching.
> >   */
> > -void kvm_async_pf_task_wait_schedule(u32 token)
> > +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode)
> >  {
> >  	struct kvm_task_sleep_node n;
> >  	DECLARE_SWAITQUEUE(wait);
> >  
> >  	lockdep_assert_irqs_disabled();
> >  
> > -	if (!kvm_async_pf_queue_task(token, &n))
> > +	if (!kvm_async_pf_queue_task(token, &n, user_mode))
> >  		return;
> >  
> >  	for (;;) {
> > @@ -148,6 +158,8 @@ void kvm_async_pf_task_wait_schedule(u32 token)
> >  		local_irq_disable();
> >  	}
> >  	finish_swait(&n.wq, &wait);
> > +	if (n.is_err)
> > +		handle_async_pf_error(user_mode);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
> >  
> > @@ -177,7 +189,7 @@ static void apf_task_wake_all(void)
> >  	}
> >  }
> >  
> > -void kvm_async_pf_task_wake(u32 token)
> > +void kvm_async_pf_task_wake(u32 token, bool is_err)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -208,9 +220,11 @@ void kvm_async_pf_task_wake(u32 token)
> >  		}
> >  		n->token = token;
> >  		n->cpu = smp_processor_id();
> > +		n->is_err = is_err;
> >  		init_swait_queue_head(&n->wq);
> >  		hlist_add_head(&n->link, &b->list);
> >  	} else {
> > +		n->is_err = is_err;
> >  		apf_task_wake_one(n);
> >  	}
> >  	raw_spin_unlock(&b->lock);
> > @@ -256,14 +270,15 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
> >  		panic("Host injected async #PF in kernel mode\n");
> >  
> >  	/* Page is swapped out by the host. */
> > -	kvm_async_pf_task_wait_schedule(token);
> > +	kvm_async_pf_task_wait_schedule(token, user_mode(regs));
> >  	return true;
> >  }
> >  NOKPROBE_SYMBOL(__kvm_handle_async_pf);
> >  
> >  __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
> >  {
> > -	u32 token;
> > +	u32 token, ready_flags;
> > +	bool is_err;
> >  
> >  	entering_ack_irq();
> >  
> > @@ -271,7 +286,9 @@ __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
> >  
> >  	if (__this_cpu_read(apf_reason.enabled)) {
> >  		token = __this_cpu_read(apf_reason.token);
> > -		kvm_async_pf_task_wake(token);
> > +		ready_flags = __this_cpu_read(apf_reason.ready_flags);
> > +		is_err = ready_flags & KVM_PV_REASON_PAGE_ERROR;
> > +		kvm_async_pf_task_wake(token, is_err);
> >  		__this_cpu_write(apf_reason.token, 0);
> >  		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
> >  	}
> > @@ -335,6 +352,9 @@ static void kvm_guest_cpu_init(void)
> >  
> >  		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
> >  
> > +		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_ERROR))
> > +			pa |= KVM_ASYNC_PF_SEND_ERROR;
> > +
> >  		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
> >  		__this_cpu_write(apf_reason.enabled, 1);
> >  		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 253b8e875ccd..f811f36e0c24 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  			     (1 << KVM_FEATURE_PV_SEND_IPI) |
> >  			     (1 << KVM_FEATURE_POLL_CONTROL) |
> >  			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
> > -			     (1 << KVM_FEATURE_ASYNC_PF_INT);
> > +			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
> > +			     (1 << KVM_FEATURE_ASYNC_PF_ERROR);
> >  
> >  		if (sched_info_on())
> >  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e207900ab303..634182bb07c7 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4175,7 +4175,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
> >  		vcpu->arch.apf.host_apf_flags = 0;
> >  		local_irq_disable();
> > -		kvm_async_pf_task_wait_schedule(fault_address);
> > +		kvm_async_pf_task_wait_schedule(fault_address, 0);
> >  		local_irq_enable();
> >  	} else {
> >  		WARN_ONCE(1, "Unexpected host async PF flags: %x\n", flags);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4c5205434b9c..c3b2097f2eaa 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2690,8 +2690,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >  {
> >  	gpa_t gpa = data & ~0x3f;
> >  
> > -	/* Bits 4:5 are reserved, Should be zero */
> > -	if (data & 0x30)
> > +	/* Bits 5 is reserved, Should be zero */
> > +	if (data & 0x20)
> >  		return 1;
> >  
> >  	vcpu->arch.apf.msr_en_val = data;
> > @@ -2703,11 +2703,12 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >  	}
> >  
> >  	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
> > -					sizeof(u64)))
> > +					sizeof(u64) + sizeof(u32)))
> >  		return 1;
> >  
> >  	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
> >  	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
> > +	vcpu->arch.apf.send_pf_error = data & KVM_ASYNC_PF_SEND_ERROR;
> >  
> >  	kvm_async_pf_wakeup_all(vcpu);
> >  
> > @@ -10481,12 +10482,25 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
> >  				      sizeof(reason));
> >  }
> >  
> > -static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
> > +static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token,
> > +				     bool is_err)
> >  {
> >  	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data, token);
> > +	int ret;
> > +	u32 ready_flags = 0;
> > +
> > +	if (is_err && vcpu->arch.apf.send_pf_error)
> > +		ready_flags = KVM_PV_REASON_PAGE_ERROR;
> > +
> > +	ret = kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> > +					    &token, offset, sizeof(token));
> > +	if (ret < 0)
> > +		return ret;
> >  
> > +	offset = offsetof(struct kvm_vcpu_pv_apf_data, ready_flags);
> >  	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> > -					     &token, offset, sizeof(token));
> > +					    &ready_flags, offset,
> > +					    sizeof(ready_flags));
> >  }
> >  
> >  static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
> > @@ -10571,6 +10585,8 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  				 struct kvm_async_pf *work)
> >  {
> > +	bool present_injected = false;
> > +
> >  	struct kvm_lapic_irq irq = {
> >  		.delivery_mode = APIC_DM_FIXED,
> >  		.vector = vcpu->arch.apf.vec
> > @@ -10584,16 +10600,18 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  
> >  	if ((work->wakeup_all || work->notpresent_injected) &&
> >  	    kvm_pv_async_pf_enabled(vcpu) &&
> > -	    !apf_put_user_ready(vcpu, work->arch.token)) {
> > +	    !apf_put_user_ready(vcpu, work->arch.token, work->error_code)) {
> >  		vcpu->arch.apf.pageready_pending = true;
> >  		kvm_apic_set_irq(vcpu, &irq, NULL);
> > +		present_injected = true;
> >  	}
> >  
> > -	if (work->error_code) {
> > +	if (work->error_code &&
> > +	    (!present_injected || !vcpu->arch.apf.send_pf_error)) {
> >  		/*
> > -		 * Page fault failed and we don't have a way to report
> > -		 * errors back to guest yet. So save this pfn and force
> > -		 * sync page fault next time.
> > +		 * Page fault failed. If we did not report error back
> > +		 * to guest, save this pfn and force sync page fault next
> > +		 * time.
> >  		 */
> >  		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
> >  	}
> 
> -- 
> Vitaly
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-17 20:33       ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-17 20:33 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: wanpengli, kvm, miklos, sean.j.christopherson, linux-kernel,
	virtio-fs, pbonzini

On Wed, Jun 17, 2020 at 03:12:03PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > As of now asynchronous page fault mecahanism assumes host will always be
> > successful in resolving page fault. So there are only two states, that
> > is page is not present and page is ready.
> >
> > If a page is backed by a file and that file has been truncated (as
> > can be the case with virtio-fs), then page fault handler on host returns
> > -EFAULT.
> >
> > As of now async page fault logic does not look at error code (-EFAULT)
> > returned by get_user_pages_remote() and returns PAGE_READY to guest.
> > Guest tries to access page and page fault happnes again. And this
> > gets kvm into an infinite loop. (Killing host process gets kvm out of
> > this loop though).
> >
> > This patch adds another state to async page fault logic which allows
> > host to return error to guest. Once guest knows that async page fault
> > can't be resolved, it can send SIGBUS to host process (if user space
> > was accessing the page in question).
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/virt/kvm/cpuid.rst     |  4 +++
> >  Documentation/virt/kvm/msr.rst       | 10 +++++---
> >  arch/x86/include/asm/kvm_host.h      |  3 +++
> >  arch/x86/include/asm/kvm_para.h      |  8 +++---
> >  arch/x86/include/uapi/asm/kvm_para.h | 10 ++++++--
> >  arch/x86/kernel/kvm.c                | 34 ++++++++++++++++++++-----
> >  arch/x86/kvm/cpuid.c                 |  3 ++-
> >  arch/x86/kvm/mmu/mmu.c               |  2 +-
> >  arch/x86/kvm/x86.c                   | 38 ++++++++++++++++++++--------
> >  9 files changed, 83 insertions(+), 29 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> > index a7dff9186bed..a4497f3a6e7f 100644
> > --- a/Documentation/virt/kvm/cpuid.rst
> > +++ b/Documentation/virt/kvm/cpuid.rst
> > @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
> >                                                async pf acknowledgment msr
> >                                                0x4b564d07.
> >  
> > +KVM_FEATURE_ASYNC_PF_ERROR        15          paravirtualized async PF error
> > +                                              can be enabled by setting bit 4
> > +                                              when writing to msr 0x4b564d02
> > +
> >  KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
> >                                                per-cpu warps are expeced in
> >                                                kvmclock
> > diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> > index e37a14c323d2..513eb8e0b0eb 100644
> > --- a/Documentation/virt/kvm/msr.rst
> > +++ b/Documentation/virt/kvm/msr.rst
> > @@ -202,19 +202,21 @@ data:
> >  
> >  		/* Used for 'page ready' events delivered via interrupt notification */
> >  		__u32 token;
> > -
> > -		__u8 pad[56];
> > +                __u32 ready_flags;
> > +		__u8 pad[52];
> >  		__u32 enabled;
> >  	  };
> >  
> > -	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> > +	Bits 5 of the MSR is reserved and should be zero. Bit 0 is set to 1
> >  	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
> >  	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
> >  	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
> >  	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
> >  	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
> >  	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
> > -	CPUID.
> > +	CPUID. Bit 4 enables reporting of page fault errors if hypervisor
> > +        encounters errors while faulting in page. Bit 4 can only be set if
> > +        KVM_FEATURE_ASYNC_PF_ERROR is present in CPUID.
> 
> Would it actually make sense to deliver the exact error from
> get_user_pages() and not a binary failure/no-failure? Is the guest
> interested in how exactly we failed?

I was thinking about it but could not find a use case. I am not sure
what will guest do with this error code.

If error happened in user space because mmap() tried to access portion
of file which got truncated on host then we probably should treat
it same way as if a process on host gets SIGBUS. I think they get
si_code as BUS_ADRERR and also the faulting address. This patch as
of now does not send faulting gva because I am not sure how to get
that. In the past I had proposed a patch where I was sending gva
also as part of error.

https://lore.kernel.org/kvm/20200331194011.24834-3-vgoyal@redhat.com/

I am not sure if that's the right way to get to GVA. Is it true
that following will give me guest virtual address. It seemed to
work for me in the past.

+	if (exit_qualification | EPT_VIOLATION_GLA_VALID) {
+		gva = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	}

If error happened in kernel, we will just simply call fixup_exception()
or panic().

So while sending error back might not hurt, I don't have a good use
case in mind yet. So I thought of leaving it as a future item when
somebody cares about it.

> 
> >  
> >  	'Page not present' events are currently always delivered as synthetic
> >  	#PF exception. During delivery of these events APF CR2 register contains
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 348a73106556..ff8dbc604dbe 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -779,6 +779,7 @@ struct kvm_vcpu_arch {
> >  		bool delivery_as_pf_vmexit;
> >  		bool pageready_pending;
> >  		gfn_t error_gfn;
> > +		bool send_pf_error;
> >  	} apf;
> >  
> >  	/* OSVW MSRs (AMD only) */
> > @@ -1680,6 +1681,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> >  			       struct kvm_async_pf *work);
> >  void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
> >  bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
> > +void kvm_arch_async_page_fault_error(struct kvm_vcpu *vcpu,
> > +				     struct kvm_async_pf *work);
> >  extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  
> >  int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index bbc43e5411d9..6c738e91ca2c 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -89,8 +89,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> >  bool kvm_para_available(void);
> >  unsigned int kvm_arch_para_features(void);
> >  unsigned int kvm_arch_para_hints(void);
> > -void kvm_async_pf_task_wait_schedule(u32 token);
> > -void kvm_async_pf_task_wake(u32 token);
> > +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode);
> > +void kvm_async_pf_task_wake(u32 token, bool is_err);
> >  u32 kvm_read_and_reset_apf_flags(void);
> >  void kvm_disable_steal_time(void);
> >  bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
> > @@ -120,8 +120,8 @@ static inline void kvm_spinlock_init(void)
> >  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
> >  
> >  #else /* CONFIG_KVM_GUEST */
> > -#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
> > -#define kvm_async_pf_task_wake(T) do {} while(0)
> > +#define kvm_async_pf_task_wait_schedule(T, U) do {} while(0)
> > +#define kvm_async_pf_task_wake(T, I) do {} while(0)
> >  
> >  static inline bool kvm_para_available(void)
> >  {
> > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > index 812e9b4c1114..b43fd2ddc949 100644
> > --- a/arch/x86/include/uapi/asm/kvm_para.h
> > +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > @@ -32,6 +32,7 @@
> >  #define KVM_FEATURE_POLL_CONTROL	12
> >  #define KVM_FEATURE_PV_SCHED_YIELD	13
> >  #define KVM_FEATURE_ASYNC_PF_INT	14
> > +#define KVM_FEATURE_ASYNC_PF_ERROR	15
> >  
> >  #define KVM_HINTS_REALTIME      0
> >  
> > @@ -85,6 +86,7 @@ struct kvm_clock_pairing {
> >  #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
> >  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
> >  #define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
> > +#define KVM_ASYNC_PF_SEND_ERROR			(1 << 4)
> >  
> >  /* MSR_KVM_ASYNC_PF_INT */
> >  #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
> > @@ -119,14 +121,18 @@ struct kvm_mmu_op_release_pt {
> >  #define KVM_PV_REASON_PAGE_NOT_PRESENT 1
> >  #define KVM_PV_REASON_PAGE_READY 2
> >  
> > +
> > +/* Bit flags for ready_flags field */
> > +#define KVM_PV_REASON_PAGE_ERROR 1
> > +
> >  struct kvm_vcpu_pv_apf_data {
> >  	/* Used for 'page not present' events delivered via #PF */
> >  	__u32 flags;
> >  
> >  	/* Used for 'page ready' events delivered via interrupt notification */
> >  	__u32 token;
> > -
> > -	__u8 pad[56];
> > +	__u32 ready_flags;
> > +	__u8 pad[52];
> >  	__u32 enabled;
> >  };
> >  
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index ff95429d206b..2ee9c9d971ae 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -75,6 +75,7 @@ struct kvm_task_sleep_node {
> >  	struct swait_queue_head wq;
> >  	u32 token;
> >  	int cpu;
> > +	bool is_err;
> >  };
> >  
> >  static struct kvm_task_sleep_head {
> > @@ -97,7 +98,14 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >  	return NULL;
> >  }
> >  
> > -static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> > +static void handle_async_pf_error(bool user_mode)
> > +{
> > +	if (user_mode)
> > +		send_sig_info(SIGBUS, SEND_SIG_PRIV, current);
> > +}
> > +
> > +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n,
> > +				    bool user_mode)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -107,6 +115,8 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> >  	e = _find_apf_task(b, token);
> >  	if (e) {
> >  		/* dummy entry exist -> wake up was delivered ahead of PF */
> > +		if (e->is_err)
> > +			handle_async_pf_error(user_mode);
> >  		hlist_del(&e->link);
> >  		raw_spin_unlock(&b->lock);
> >  		kfree(e);
> > @@ -128,14 +138,14 @@ static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
> >   * Invoked from the async pagefault handling code or from the VM exit page
> >   * fault handler. In both cases RCU is watching.
> >   */
> > -void kvm_async_pf_task_wait_schedule(u32 token)
> > +void kvm_async_pf_task_wait_schedule(u32 token, bool user_mode)
> >  {
> >  	struct kvm_task_sleep_node n;
> >  	DECLARE_SWAITQUEUE(wait);
> >  
> >  	lockdep_assert_irqs_disabled();
> >  
> > -	if (!kvm_async_pf_queue_task(token, &n))
> > +	if (!kvm_async_pf_queue_task(token, &n, user_mode))
> >  		return;
> >  
> >  	for (;;) {
> > @@ -148,6 +158,8 @@ void kvm_async_pf_task_wait_schedule(u32 token)
> >  		local_irq_disable();
> >  	}
> >  	finish_swait(&n.wq, &wait);
> > +	if (n.is_err)
> > +		handle_async_pf_error(user_mode);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
> >  
> > @@ -177,7 +189,7 @@ static void apf_task_wake_all(void)
> >  	}
> >  }
> >  
> > -void kvm_async_pf_task_wake(u32 token)
> > +void kvm_async_pf_task_wake(u32 token, bool is_err)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -208,9 +220,11 @@ void kvm_async_pf_task_wake(u32 token)
> >  		}
> >  		n->token = token;
> >  		n->cpu = smp_processor_id();
> > +		n->is_err = is_err;
> >  		init_swait_queue_head(&n->wq);
> >  		hlist_add_head(&n->link, &b->list);
> >  	} else {
> > +		n->is_err = is_err;
> >  		apf_task_wake_one(n);
> >  	}
> >  	raw_spin_unlock(&b->lock);
> > @@ -256,14 +270,15 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
> >  		panic("Host injected async #PF in kernel mode\n");
> >  
> >  	/* Page is swapped out by the host. */
> > -	kvm_async_pf_task_wait_schedule(token);
> > +	kvm_async_pf_task_wait_schedule(token, user_mode(regs));
> >  	return true;
> >  }
> >  NOKPROBE_SYMBOL(__kvm_handle_async_pf);
> >  
> >  __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
> >  {
> > -	u32 token;
> > +	u32 token, ready_flags;
> > +	bool is_err;
> >  
> >  	entering_ack_irq();
> >  
> > @@ -271,7 +286,9 @@ __visible void __irq_entry kvm_async_pf_intr(struct pt_regs *regs)
> >  
> >  	if (__this_cpu_read(apf_reason.enabled)) {
> >  		token = __this_cpu_read(apf_reason.token);
> > -		kvm_async_pf_task_wake(token);
> > +		ready_flags = __this_cpu_read(apf_reason.ready_flags);
> > +		is_err = ready_flags & KVM_PV_REASON_PAGE_ERROR;
> > +		kvm_async_pf_task_wake(token, is_err);
> >  		__this_cpu_write(apf_reason.token, 0);
> >  		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
> >  	}
> > @@ -335,6 +352,9 @@ static void kvm_guest_cpu_init(void)
> >  
> >  		wrmsrl(MSR_KVM_ASYNC_PF_INT, KVM_ASYNC_PF_VECTOR);
> >  
> > +		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_ERROR))
> > +			pa |= KVM_ASYNC_PF_SEND_ERROR;
> > +
> >  		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
> >  		__this_cpu_write(apf_reason.enabled, 1);
> >  		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 253b8e875ccd..f811f36e0c24 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  			     (1 << KVM_FEATURE_PV_SEND_IPI) |
> >  			     (1 << KVM_FEATURE_POLL_CONTROL) |
> >  			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
> > -			     (1 << KVM_FEATURE_ASYNC_PF_INT);
> > +			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
> > +			     (1 << KVM_FEATURE_ASYNC_PF_ERROR);
> >  
> >  		if (sched_info_on())
> >  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e207900ab303..634182bb07c7 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4175,7 +4175,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
> >  		vcpu->arch.apf.host_apf_flags = 0;
> >  		local_irq_disable();
> > -		kvm_async_pf_task_wait_schedule(fault_address);
> > +		kvm_async_pf_task_wait_schedule(fault_address, 0);
> >  		local_irq_enable();
> >  	} else {
> >  		WARN_ONCE(1, "Unexpected host async PF flags: %x\n", flags);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4c5205434b9c..c3b2097f2eaa 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2690,8 +2690,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >  {
> >  	gpa_t gpa = data & ~0x3f;
> >  
> > -	/* Bits 4:5 are reserved, Should be zero */
> > -	if (data & 0x30)
> > +	/* Bits 5 is reserved, Should be zero */
> > +	if (data & 0x20)
> >  		return 1;
> >  
> >  	vcpu->arch.apf.msr_en_val = data;
> > @@ -2703,11 +2703,12 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
> >  	}
> >  
> >  	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
> > -					sizeof(u64)))
> > +					sizeof(u64) + sizeof(u32)))
> >  		return 1;
> >  
> >  	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
> >  	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
> > +	vcpu->arch.apf.send_pf_error = data & KVM_ASYNC_PF_SEND_ERROR;
> >  
> >  	kvm_async_pf_wakeup_all(vcpu);
> >  
> > @@ -10481,12 +10482,25 @@ static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
> >  				      sizeof(reason));
> >  }
> >  
> > -static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
> > +static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token,
> > +				     bool is_err)
> >  {
> >  	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data, token);
> > +	int ret;
> > +	u32 ready_flags = 0;
> > +
> > +	if (is_err && vcpu->arch.apf.send_pf_error)
> > +		ready_flags = KVM_PV_REASON_PAGE_ERROR;
> > +
> > +	ret = kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> > +					    &token, offset, sizeof(token));
> > +	if (ret < 0)
> > +		return ret;
> >  
> > +	offset = offsetof(struct kvm_vcpu_pv_apf_data, ready_flags);
> >  	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
> > -					     &token, offset, sizeof(token));
> > +					    &ready_flags, offset,
> > +					    sizeof(ready_flags));
> >  }
> >  
> >  static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
> > @@ -10571,6 +10585,8 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  				 struct kvm_async_pf *work)
> >  {
> > +	bool present_injected = false;
> > +
> >  	struct kvm_lapic_irq irq = {
> >  		.delivery_mode = APIC_DM_FIXED,
> >  		.vector = vcpu->arch.apf.vec
> > @@ -10584,16 +10600,18 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >  
> >  	if ((work->wakeup_all || work->notpresent_injected) &&
> >  	    kvm_pv_async_pf_enabled(vcpu) &&
> > -	    !apf_put_user_ready(vcpu, work->arch.token)) {
> > +	    !apf_put_user_ready(vcpu, work->arch.token, work->error_code)) {
> >  		vcpu->arch.apf.pageready_pending = true;
> >  		kvm_apic_set_irq(vcpu, &irq, NULL);
> > +		present_injected = true;
> >  	}
> >  
> > -	if (work->error_code) {
> > +	if (work->error_code &&
> > +	    (!present_injected || !vcpu->arch.apf.send_pf_error)) {
> >  		/*
> > -		 * Page fault failed and we don't have a way to report
> > -		 * errors back to guest yet. So save this pfn and force
> > -		 * sync page fault next time.
> > +		 * Page fault failed. If we did not report error back
> > +		 * to guest, save this pfn and force sync page fault next
> > +		 * time.
> >  		 */
> >  		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
> >  	}
> 
> -- 
> Vitaly
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-17 18:32       ` [Virtio-fs] " Sean Christopherson
@ 2020-06-17 21:51         ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-17 21:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, virtio-fs, miklos, stefanha, dgilbert,
	pbonzini, wanpengli, kvm, linux-kernel

On Wed, Jun 17, 2020 at 11:32:24AM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2020 at 03:12:03PM +0200, Vitaly Kuznetsov wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > As of now asynchronous page fault mecahanism assumes host will always be
> > > successful in resolving page fault. So there are only two states, that
> > > is page is not present and page is ready.
> > >
> > > If a page is backed by a file and that file has been truncated (as
> > > can be the case with virtio-fs), then page fault handler on host returns
> > > -EFAULT.
> > >
> > > As of now async page fault logic does not look at error code (-EFAULT)
> > > returned by get_user_pages_remote() and returns PAGE_READY to guest.
> > > Guest tries to access page and page fault happnes again. And this
> > > gets kvm into an infinite loop. (Killing host process gets kvm out of
> > > this loop though).
> 
> Isn't this already fixed by patch 1/3 "kvm,x86: Force sync fault if previous
> attempts failed"?  If it isn't, it should be, i.e. we should fix KVM before
> adding what are effectively optimizations on top.

Yes you are right. It is now fixed by patch 1. I wrote changelog first
and later I changed the order of patches and forgot to fix the changelog.

So yes, with previous patch we will not exit to qemu with -EFAULT or
any other error code.

This patch improves upon it and when possible injects error into guest
so that guest can handle it.

> And, it's not clear that
> the optimizations are necessary, e.g. I assume the virtio-fs truncation
> scenario is relatively uncommon, i.e. not performance sensitive?

As of now this scenario is less common but with virtio-fs this truncation
can become more common. A file can be truncated on host or by another
guest. Its a full filesystem. And most of the time filesystem will
try to cope with it but invariably there will be some cases where
user space/kernel space will try to load/store to a page which got
truncated.  And in that case, guest can do error handling much 
better as long as we can inject error back into guest.

For process, we can send SIGBUS and for kernel access, a special
version of memcpy_mcsafe() equivalent will be used which simply
return -EIO to user space. 

So these optimizations make sense to me because guest can continue
to work instead of dying when error occurs.

> 
> > >
> > > This patch adds another state to async page fault logic which allows
> > > host to return error to guest. Once guest knows that async page fault
> > > can't be resolved, it can send SIGBUS to host process (if user space
> 
> I assume this is supposed to be "it can send SIGBUS to guest process"?
> Otherwise none of this makes sense (to me).

Yes, I meant SIGBUS to guest process (and not host process). Its a typo.

> 
> > > was accessing the page in question).
> 
> Allowing the guest to opt-in to intercepting host page allocation failures
> feels wrong, and fragile.  KVM can't possibly know whether an allocation
> failure is something that should be forwarded to the guest, as KVM doesn't
> know the physical backing for any given hva/gfn, e.g. the error could be
> due to a physical device failure or a configuration issue.  Relying on the
> async #PF mechanism to prevent allocation failures from crashing the guest
> is fragile because there is no guarantee that a #PF can be async.
> 

Actually it is it does not necessarily have to be async #PF. What we
are missing right now, is a synchronous method to report errors back.
And once that is available (#VE), then we can use that instead. So
if KVM decides to do synchrous fault, we can inject error using #VE
otherwise use interrupt based APF READY.

IOW, we are not necessarily relying on #PF being async. For now if
page fault is sync and error happens, we will exit to qemu and
VM will freeze/die.

> IMO, the virtio-fs truncation use case should first be addressed in a way
> that requires explicit userspace intervention, e.g. by enhancing
> kvm_handle_bad_page() to provide the necessary information to userspace so
> that userspace can reflect select errors into the guest.

Which errors do you want to filter out. And this does not take away
filtering capability. When we figure out what needs to be filtered
out, we can do something like kvm_send_hwpoison_signal() to signal
to user space and not exit to user space with error.

Havind said that, I see that currently the way code is written, we
capture the error and will inject it in guest if guest opted for
it.

kvm_check_async_pf_completion() {
	kvm_arch_async_page_ready();
	kvm_arch_async_page_present();
}

I really don't understand what's the significance of
kvm_arch_async_page_ready(). Why does it go all they way into
try_async_pf() and why does it set prefault=1.

Anway, so we executed async page fault which did get_user_pages_remote()
and populated page tables. Now we check for completion can call
kvm_arch_async_page_ready() which will call into direct_page_fault()
and try_async_pf() and handle_abnormal_pfn(). I think we could
capture the error from kvm_arch_async_page_ready() also. That way
handle_abnormal_pfn() will be able to filter it out and do out
of band action (like poisoned pages) and page_present() will not
inject error in guest.

> The reflection
> could piggyback whatever vector is used by async page faults (#PF or #VE),
> but would not be an async page fault per se.  If an async #PF happens to
> encounter an allocation failure, it would naturally fall back to the
> synchronous path (provided by patch 1/3) and the synchronous path would
> automagically handle the error as above.

We could do that as well. So you are suggesting that instead of reporting
error in async PF PAGE_READY event. Instead send PAGE_READY and upon retry
path force sync page fault and then report error using new race free
vector (#VE). 

We don't have that vector yet though and given the conversation it might
not be simple to implement that. 

I had played with using MCE also to inject errors as Andy had suggested
but this error can happen on store path too and MCEs are not generated
on store path. So this is not suitable either.

> 
> In other words, I think the guest should be able to enable "error handling"
> support without first enabling async #PF.  From a functional perspective it
> probably won't change a whole lot, but it would hopefully force us to
> concoct an overall "paravirt page fault" design as opposed to simply async
> #PF v2.

I am not sure how independent these two can be. Even if there was
a separate way to enable error handling, one will have to opt in
in async pf V2 to report errors because shared data between host
and guest will change accordingly. So we probably will end up
electing for error reporting at two places. One a generic method
and then also at async PF level.

Or maybe we we can choose to never report errors using async PF
protocol. Always choose suboptimal implementation of retrying
the fault and inject errors *always* using that new vector( #VE).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-17 21:51         ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-17 21:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: wanpengli, kvm, miklos, linux-kernel, virtio-fs, pbonzini,
	Vitaly Kuznetsov

On Wed, Jun 17, 2020 at 11:32:24AM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2020 at 03:12:03PM +0200, Vitaly Kuznetsov wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > As of now asynchronous page fault mecahanism assumes host will always be
> > > successful in resolving page fault. So there are only two states, that
> > > is page is not present and page is ready.
> > >
> > > If a page is backed by a file and that file has been truncated (as
> > > can be the case with virtio-fs), then page fault handler on host returns
> > > -EFAULT.
> > >
> > > As of now async page fault logic does not look at error code (-EFAULT)
> > > returned by get_user_pages_remote() and returns PAGE_READY to guest.
> > > Guest tries to access page and page fault happnes again. And this
> > > gets kvm into an infinite loop. (Killing host process gets kvm out of
> > > this loop though).
> 
> Isn't this already fixed by patch 1/3 "kvm,x86: Force sync fault if previous
> attempts failed"?  If it isn't, it should be, i.e. we should fix KVM before
> adding what are effectively optimizations on top.

Yes you are right. It is now fixed by patch 1. I wrote changelog first
and later I changed the order of patches and forgot to fix the changelog.

So yes, with previous patch we will not exit to qemu with -EFAULT or
any other error code.

This patch improves upon it and when possible injects error into guest
so that guest can handle it.

> And, it's not clear that
> the optimizations are necessary, e.g. I assume the virtio-fs truncation
> scenario is relatively uncommon, i.e. not performance sensitive?

As of now this scenario is less common but with virtio-fs this truncation
can become more common. A file can be truncated on host or by another
guest. Its a full filesystem. And most of the time filesystem will
try to cope with it but invariably there will be some cases where
user space/kernel space will try to load/store to a page which got
truncated.  And in that case, guest can do error handling much 
better as long as we can inject error back into guest.

For process, we can send SIGBUS and for kernel access, a special
version of memcpy_mcsafe() equivalent will be used which simply
return -EIO to user space. 

So these optimizations make sense to me because guest can continue
to work instead of dying when error occurs.

> 
> > >
> > > This patch adds another state to async page fault logic which allows
> > > host to return error to guest. Once guest knows that async page fault
> > > can't be resolved, it can send SIGBUS to host process (if user space
> 
> I assume this is supposed to be "it can send SIGBUS to guest process"?
> Otherwise none of this makes sense (to me).

Yes, I meant SIGBUS to guest process (and not host process). Its a typo.

> 
> > > was accessing the page in question).
> 
> Allowing the guest to opt-in to intercepting host page allocation failures
> feels wrong, and fragile.  KVM can't possibly know whether an allocation
> failure is something that should be forwarded to the guest, as KVM doesn't
> know the physical backing for any given hva/gfn, e.g. the error could be
> due to a physical device failure or a configuration issue.  Relying on the
> async #PF mechanism to prevent allocation failures from crashing the guest
> is fragile because there is no guarantee that a #PF can be async.
> 

Actually it is it does not necessarily have to be async #PF. What we
are missing right now, is a synchronous method to report errors back.
And once that is available (#VE), then we can use that instead. So
if KVM decides to do synchrous fault, we can inject error using #VE
otherwise use interrupt based APF READY.

IOW, we are not necessarily relying on #PF being async. For now if
page fault is sync and error happens, we will exit to qemu and
VM will freeze/die.

> IMO, the virtio-fs truncation use case should first be addressed in a way
> that requires explicit userspace intervention, e.g. by enhancing
> kvm_handle_bad_page() to provide the necessary information to userspace so
> that userspace can reflect select errors into the guest.

Which errors do you want to filter out. And this does not take away
filtering capability. When we figure out what needs to be filtered
out, we can do something like kvm_send_hwpoison_signal() to signal
to user space and not exit to user space with error.

Havind said that, I see that currently the way code is written, we
capture the error and will inject it in guest if guest opted for
it.

kvm_check_async_pf_completion() {
	kvm_arch_async_page_ready();
	kvm_arch_async_page_present();
}

I really don't understand what's the significance of
kvm_arch_async_page_ready(). Why does it go all they way into
try_async_pf() and why does it set prefault=1.

Anway, so we executed async page fault which did get_user_pages_remote()
and populated page tables. Now we check for completion can call
kvm_arch_async_page_ready() which will call into direct_page_fault()
and try_async_pf() and handle_abnormal_pfn(). I think we could
capture the error from kvm_arch_async_page_ready() also. That way
handle_abnormal_pfn() will be able to filter it out and do out
of band action (like poisoned pages) and page_present() will not
inject error in guest.

> The reflection
> could piggyback whatever vector is used by async page faults (#PF or #VE),
> but would not be an async page fault per se.  If an async #PF happens to
> encounter an allocation failure, it would naturally fall back to the
> synchronous path (provided by patch 1/3) and the synchronous path would
> automagically handle the error as above.

We could do that as well. So you are suggesting that instead of reporting
error in async PF PAGE_READY event. Instead send PAGE_READY and upon retry
path force sync page fault and then report error using new race free
vector (#VE). 

We don't have that vector yet though and given the conversation it might
not be simple to implement that. 

I had played with using MCE also to inject errors as Andy had suggested
but this error can happen on store path too and MCEs are not generated
on store path. So this is not suitable either.

> 
> In other words, I think the guest should be able to enable "error handling"
> support without first enabling async #PF.  From a functional perspective it
> probably won't change a whole lot, but it would hopefully force us to
> concoct an overall "paravirt page fault" design as opposed to simply async
> #PF v2.

I am not sure how independent these two can be. Even if there was
a separate way to enable error handling, one will have to opt in
in async pf V2 to report errors because shared data between host
and guest will change accordingly. So we probably will end up
electing for error reporting at two places. One a generic method
and then also at async PF level.

Or maybe we we can choose to never report errors using async PF
protocol. Always choose suboptimal implementation of retrying
the fault and inject errors *always* using that new vector( #VE).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-17 21:51         ` [Virtio-fs] " Vivek Goyal
@ 2020-06-17 23:00           ` Sean Christopherson
  -1 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2020-06-17 23:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Vitaly Kuznetsov, virtio-fs, miklos, stefanha, dgilbert,
	pbonzini, wanpengli, kvm, linux-kernel

On Wed, Jun 17, 2020 at 05:51:52PM -0400, Vivek Goyal wrote:
> > And, it's not clear that the optimizations are necessary, e.g. I assume the
> > virtio-fs truncation scenario is relatively uncommon, i.e. not performance
> > sensitive?
> 
> As of now this scenario is less common but with virtio-fs this truncation
> can become more common. A file can be truncated on host or by another
> guest. Its a full filesystem. And most of the time filesystem will
> try to cope with it but invariably there will be some cases where
> user space/kernel space will try to load/store to a page which got
> truncated.  And in that case, guest can do error handling much 
> better as long as we can inject error back into guest.
> 
> For process, we can send SIGBUS and for kernel access, a special
> version of memcpy_mcsafe() equivalent will be used which simply
> return -EIO to user space. 
> 
> So these optimizations make sense to me because guest can continue
> to work instead of dying when error occurs.

By "optimization" I meant avoiding an exit to host userspace.  Not killing
the guest is not an optimization, it's a full on new paravirt feature.

> > > > This patch adds another state to async page fault logic which allows
> > > > host to return error to guest. Once guest knows that async page fault
> > > > can't be resolved, it can send SIGBUS to host process (if user space
> > 
> > I assume this is supposed to be "it can send SIGBUS to guest process"?
> > Otherwise none of this makes sense (to me).
> 
> Yes, I meant SIGBUS to guest process (and not host process). Its a typo.
> 
> > 
> > > > was accessing the page in question).
> > 
> > Allowing the guest to opt-in to intercepting host page allocation failures
> > feels wrong, and fragile.  KVM can't possibly know whether an allocation
> > failure is something that should be forwarded to the guest, as KVM doesn't
> > know the physical backing for any given hva/gfn, e.g. the error could be
> > due to a physical device failure or a configuration issue.  Relying on the
> > async #PF mechanism to prevent allocation failures from crashing the guest
> > is fragile because there is no guarantee that a #PF can be async.
> > 
> 
> Actually it is it does not necessarily have to be async #PF. What we
> are missing right now, is a synchronous method to report errors back.
> And once that is available (#VE), then we can use that instead. So
> if KVM decides to do synchrous fault, we can inject error using #VE
> otherwise use interrupt based APF READY.
> 
> IOW, we are not necessarily relying on #PF being async. For now if
> page fault is sync and error happens, we will exit to qemu and
> VM will freeze/die.
> 
> > IMO, the virtio-fs truncation use case should first be addressed in a way
> > that requires explicit userspace intervention, e.g. by enhancing
> > kvm_handle_bad_page() to provide the necessary information to userspace so
> > that userspace can reflect select errors into the guest.
> 
> Which errors do you want to filter out. And this does not take away
> filtering capability. When we figure out what needs to be filtered
> out, we can do something like kvm_send_hwpoison_signal() to signal
> to user space and not exit to user space with error.

What I'm saying is that KVM cannot do the filtering.  KVM, by design, does
not know what lies behind any given hva, or what the associated gpa maps to
in the guest.  As is, userspace can't even opt out of this behavior, e.g.
it can't even "filter" on a per-VM granularity, since kvm_pv_enable_async_pf()
unconditionally allows the guest to enable the behavior[*].

Doing something similar to kvm_send_hwpoison_signal() would be fine, just
so long as userspace has full control over what to do in response to a
failure.

[*] Not technically true, as kvm_pv_enable_async_pf() explicitly prevents
setting bit 4, which is used in this RFC.  But, assuming that unrelated
code didn't exist, host userspace couldn't prevent this behavior.

> Anway, so we executed async page fault which did get_user_pages_remote()
> and populated page tables. Now we check for completion can call
> kvm_arch_async_page_ready() which will call into direct_page_fault()
> and try_async_pf() and handle_abnormal_pfn(). I think we could
> capture the error from kvm_arch_async_page_ready() also. That way
> handle_abnormal_pfn() will be able to filter it out and do out
> of band action (like poisoned pages) and page_present() will not
> inject error in guest.
> 
> > The reflection
> > could piggyback whatever vector is used by async page faults (#PF or #VE),
> > but would not be an async page fault per se.  If an async #PF happens to
> > encounter an allocation failure, it would naturally fall back to the
> > synchronous path (provided by patch 1/3) and the synchronous path would
> > automagically handle the error as above.
> 
> We could do that as well. So you are suggesting that instead of reporting
> error in async PF PAGE_READY event. Instead send PAGE_READY and upon retry
> path force sync page fault and then report error using new race free
> vector (#VE). 

Yes.  Though what I'm really saying is that the "send PAGE_READY and upon
retry path force sync page fault" should be the _only_ required behavior
from an async #PF perspective.

> We don't have that vector yet though and given the conversation it might
> not be simple to implement that. 
> 
> I had played with using MCE also to inject errors as Andy had suggested
> but this error can happen on store path too and MCEs are not generated
> on store path. So this is not suitable either.
> 
> > 
> > In other words, I think the guest should be able to enable "error handling"
> > support without first enabling async #PF.  From a functional perspective it
> > probably won't change a whole lot, but it would hopefully force us to
> > concoct an overall "paravirt page fault" design as opposed to simply async
> > #PF v2.
> 
> I am not sure how independent these two can be. Even if there was
> a separate way to enable error handling, one will have to opt in
> in async pf V2 to report errors because shared data between host
> and guest will change accordingly. So we probably will end up
> electing for error reporting at two places. One a generic method
> and then also at async PF level.
> 
> Or maybe we we can choose to never report errors using async PF
> protocol. Always choose suboptimal implementation of retrying
> the fault and inject errors *always* using that new vector( #VE).

Yes, this latter idea is the point I'm trying to make.  Async #PF could be
enhanced to support error injection, but the error injection should be
fully functional with or without async #PF being enabled.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-17 23:00           ` Sean Christopherson
  0 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2020-06-17 23:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: wanpengli, kvm, miklos, linux-kernel, virtio-fs, pbonzini,
	Vitaly Kuznetsov

On Wed, Jun 17, 2020 at 05:51:52PM -0400, Vivek Goyal wrote:
> > And, it's not clear that the optimizations are necessary, e.g. I assume the
> > virtio-fs truncation scenario is relatively uncommon, i.e. not performance
> > sensitive?
> 
> As of now this scenario is less common but with virtio-fs this truncation
> can become more common. A file can be truncated on host or by another
> guest. Its a full filesystem. And most of the time filesystem will
> try to cope with it but invariably there will be some cases where
> user space/kernel space will try to load/store to a page which got
> truncated.  And in that case, guest can do error handling much 
> better as long as we can inject error back into guest.
> 
> For process, we can send SIGBUS and for kernel access, a special
> version of memcpy_mcsafe() equivalent will be used which simply
> return -EIO to user space. 
> 
> So these optimizations make sense to me because guest can continue
> to work instead of dying when error occurs.

By "optimization" I meant avoiding an exit to host userspace.  Not killing
the guest is not an optimization, it's a full on new paravirt feature.

> > > > This patch adds another state to async page fault logic which allows
> > > > host to return error to guest. Once guest knows that async page fault
> > > > can't be resolved, it can send SIGBUS to host process (if user space
> > 
> > I assume this is supposed to be "it can send SIGBUS to guest process"?
> > Otherwise none of this makes sense (to me).
> 
> Yes, I meant SIGBUS to guest process (and not host process). Its a typo.
> 
> > 
> > > > was accessing the page in question).
> > 
> > Allowing the guest to opt-in to intercepting host page allocation failures
> > feels wrong, and fragile.  KVM can't possibly know whether an allocation
> > failure is something that should be forwarded to the guest, as KVM doesn't
> > know the physical backing for any given hva/gfn, e.g. the error could be
> > due to a physical device failure or a configuration issue.  Relying on the
> > async #PF mechanism to prevent allocation failures from crashing the guest
> > is fragile because there is no guarantee that a #PF can be async.
> > 
> 
> Actually it is it does not necessarily have to be async #PF. What we
> are missing right now, is a synchronous method to report errors back.
> And once that is available (#VE), then we can use that instead. So
> if KVM decides to do synchrous fault, we can inject error using #VE
> otherwise use interrupt based APF READY.
> 
> IOW, we are not necessarily relying on #PF being async. For now if
> page fault is sync and error happens, we will exit to qemu and
> VM will freeze/die.
> 
> > IMO, the virtio-fs truncation use case should first be addressed in a way
> > that requires explicit userspace intervention, e.g. by enhancing
> > kvm_handle_bad_page() to provide the necessary information to userspace so
> > that userspace can reflect select errors into the guest.
> 
> Which errors do you want to filter out. And this does not take away
> filtering capability. When we figure out what needs to be filtered
> out, we can do something like kvm_send_hwpoison_signal() to signal
> to user space and not exit to user space with error.

What I'm saying is that KVM cannot do the filtering.  KVM, by design, does
not know what lies behind any given hva, or what the associated gpa maps to
in the guest.  As is, userspace can't even opt out of this behavior, e.g.
it can't even "filter" on a per-VM granularity, since kvm_pv_enable_async_pf()
unconditionally allows the guest to enable the behavior[*].

Doing something similar to kvm_send_hwpoison_signal() would be fine, just
so long as userspace has full control over what to do in response to a
failure.

[*] Not technically true, as kvm_pv_enable_async_pf() explicitly prevents
setting bit 4, which is used in this RFC.  But, assuming that unrelated
code didn't exist, host userspace couldn't prevent this behavior.

> Anway, so we executed async page fault which did get_user_pages_remote()
> and populated page tables. Now we check for completion can call
> kvm_arch_async_page_ready() which will call into direct_page_fault()
> and try_async_pf() and handle_abnormal_pfn(). I think we could
> capture the error from kvm_arch_async_page_ready() also. That way
> handle_abnormal_pfn() will be able to filter it out and do out
> of band action (like poisoned pages) and page_present() will not
> inject error in guest.
> 
> > The reflection
> > could piggyback whatever vector is used by async page faults (#PF or #VE),
> > but would not be an async page fault per se.  If an async #PF happens to
> > encounter an allocation failure, it would naturally fall back to the
> > synchronous path (provided by patch 1/3) and the synchronous path would
> > automagically handle the error as above.
> 
> We could do that as well. So you are suggesting that instead of reporting
> error in async PF PAGE_READY event. Instead send PAGE_READY and upon retry
> path force sync page fault and then report error using new race free
> vector (#VE). 

Yes.  Though what I'm really saying is that the "send PAGE_READY and upon
retry path force sync page fault" should be the _only_ required behavior
from an async #PF perspective.

> We don't have that vector yet though and given the conversation it might
> not be simple to implement that. 
> 
> I had played with using MCE also to inject errors as Andy had suggested
> but this error can happen on store path too and MCEs are not generated
> on store path. So this is not suitable either.
> 
> > 
> > In other words, I think the guest should be able to enable "error handling"
> > support without first enabling async #PF.  From a functional perspective it
> > probably won't change a whole lot, but it would hopefully force us to
> > concoct an overall "paravirt page fault" design as opposed to simply async
> > #PF v2.
> 
> I am not sure how independent these two can be. Even if there was
> a separate way to enable error handling, one will have to opt in
> in async pf V2 to report errors because shared data between host
> and guest will change accordingly. So we probably will end up
> electing for error reporting at two places. One a generic method
> and then also at async PF level.
> 
> Or maybe we we can choose to never report errors using async PF
> protocol. Always choose suboptimal implementation of retrying
> the fault and inject errors *always* using that new vector( #VE).

Yes, this latter idea is the point I'm trying to make.  Async #PF could be
enhanced to support error injection, but the error injection should be
fully functional with or without async #PF being enabled.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-17 23:00           ` [Virtio-fs] " Sean Christopherson
@ 2020-06-17 23:05             ` Sean Christopherson
  -1 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2020-06-17 23:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Vitaly Kuznetsov, virtio-fs, miklos, stefanha, dgilbert,
	pbonzini, wanpengli, kvm, linux-kernel

On Wed, Jun 17, 2020 at 04:00:52PM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2020 at 05:51:52PM -0400, Vivek Goyal wrote:
> What I'm saying is that KVM cannot do the filtering.  KVM, by design, does
> not know what lies behind any given hva, or what the associated gpa maps to
> in the guest.  As is, userspace can't even opt out of this behavior, e.g.
> it can't even "filter" on a per-VM granularity, since kvm_pv_enable_async_pf()
> unconditionally allows the guest to enable the behavior[*].

Let me rephrase that slightly.  KVM can do the filtering, but it cannot make
the decision on what to filter.  E.g. if the use case is compatible with doing
this at a memslot level, then a memslot flag could be added to control the
behavior.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-17 23:05             ` Sean Christopherson
  0 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2020-06-17 23:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: wanpengli, kvm, miklos, linux-kernel, virtio-fs, pbonzini,
	Vitaly Kuznetsov

On Wed, Jun 17, 2020 at 04:00:52PM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2020 at 05:51:52PM -0400, Vivek Goyal wrote:
> What I'm saying is that KVM cannot do the filtering.  KVM, by design, does
> not know what lies behind any given hva, or what the associated gpa maps to
> in the guest.  As is, userspace can't even opt out of this behavior, e.g.
> it can't even "filter" on a per-VM granularity, since kvm_pv_enable_async_pf()
> unconditionally allows the guest to enable the behavior[*].

Let me rephrase that slightly.  KVM can do the filtering, but it cannot make
the decision on what to filter.  E.g. if the use case is compatible with doing
this at a memslot level, then a memslot flag could be added to control the
behavior.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
  2020-06-17 23:05             ` [Virtio-fs] " Sean Christopherson
@ 2020-06-18 12:47               ` Vivek Goyal
  -1 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-18 12:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, virtio-fs, miklos, stefanha, dgilbert,
	pbonzini, wanpengli, kvm, linux-kernel

On Wed, Jun 17, 2020 at 04:05:48PM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2020 at 04:00:52PM -0700, Sean Christopherson wrote:
> > On Wed, Jun 17, 2020 at 05:51:52PM -0400, Vivek Goyal wrote:
> > What I'm saying is that KVM cannot do the filtering.  KVM, by design, does
> > not know what lies behind any given hva, or what the associated gpa maps to
> > in the guest.  As is, userspace can't even opt out of this behavior, e.g.
> > it can't even "filter" on a per-VM granularity, since kvm_pv_enable_async_pf()
> > unconditionally allows the guest to enable the behavior[*].
> 
> Let me rephrase that slightly.  KVM can do the filtering, but it cannot make
> the decision on what to filter.  E.g. if the use case is compatible with doing
> this at a memslot level, then a memslot flag could be added to control the
> behavior.

Ok, may be. But what is that thing which you want to filter out. Just
creating a framework for filtering selective regions without any specific
use case is hard.

Right now we have one switch to enable/disable error reporting and
this can be turned off both at qemu level as well as guest level.

If the desire is that this needs to me more finer grained, I need
to have some examples which show that in these cases we don't want
to report page fault errors.

Anyway, it seems that atleast first patch is less contentious and
can be relatively easily be done. That is exit to user space if
page fault error happens instead of getting into an infinite loop.
I will post that separately.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest
@ 2020-06-18 12:47               ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2020-06-18 12:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: wanpengli, kvm, miklos, linux-kernel, virtio-fs, pbonzini,
	Vitaly Kuznetsov

On Wed, Jun 17, 2020 at 04:05:48PM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2020 at 04:00:52PM -0700, Sean Christopherson wrote:
> > On Wed, Jun 17, 2020 at 05:51:52PM -0400, Vivek Goyal wrote:
> > What I'm saying is that KVM cannot do the filtering.  KVM, by design, does
> > not know what lies behind any given hva, or what the associated gpa maps to
> > in the guest.  As is, userspace can't even opt out of this behavior, e.g.
> > it can't even "filter" on a per-VM granularity, since kvm_pv_enable_async_pf()
> > unconditionally allows the guest to enable the behavior[*].
> 
> Let me rephrase that slightly.  KVM can do the filtering, but it cannot make
> the decision on what to filter.  E.g. if the use case is compatible with doing
> this at a memslot level, then a memslot flag could be added to control the
> behavior.

Ok, may be. But what is that thing which you want to filter out. Just
creating a framework for filtering selective regions without any specific
use case is hard.

Right now we have one switch to enable/disable error reporting and
this can be turned off both at qemu level as well as guest level.

If the desire is that this needs to me more finer grained, I need
to have some examples which show that in these cases we don't want
to report page fault errors.

Anyway, it seems that atleast first patch is less contentious and
can be relatively easily be done. That is exit to user space if
page fault error happens instead of getting into an infinite loop.
I will post that separately.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-06-18 12:47 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 21:48 [RFC PATCH 0/3] kvm,x86: Improve kvm page fault error handling Vivek Goyal
2020-06-16 21:48 ` [Virtio-fs] [RFC PATCH 0/3] kvm, x86: " Vivek Goyal
2020-06-16 21:48 ` [PATCH 1/3] kvm,x86: Force sync fault if previous attempts failed Vivek Goyal
2020-06-16 21:48   ` [Virtio-fs] [PATCH 1/3] kvm, x86: " Vivek Goyal
2020-06-17 13:02   ` [PATCH 1/3] kvm,x86: " Vitaly Kuznetsov
2020-06-17 13:02     ` [Virtio-fs] [PATCH 1/3] kvm, x86: " Vitaly Kuznetsov
2020-06-17 19:58     ` [PATCH 1/3] kvm,x86: " Vivek Goyal
2020-06-17 19:58       ` [Virtio-fs] [PATCH 1/3] kvm, x86: " Vivek Goyal
2020-06-16 21:48 ` [PATCH 2/3] kvm: Add capability to be able to report async pf error to guest Vivek Goyal
2020-06-16 21:48   ` [Virtio-fs] " Vivek Goyal
2020-06-17 13:12   ` Vitaly Kuznetsov
2020-06-17 13:12     ` [Virtio-fs] " Vitaly Kuznetsov
2020-06-17 18:32     ` Sean Christopherson
2020-06-17 18:32       ` [Virtio-fs] " Sean Christopherson
2020-06-17 21:51       ` Vivek Goyal
2020-06-17 21:51         ` [Virtio-fs] " Vivek Goyal
2020-06-17 23:00         ` Sean Christopherson
2020-06-17 23:00           ` [Virtio-fs] " Sean Christopherson
2020-06-17 23:05           ` Sean Christopherson
2020-06-17 23:05             ` [Virtio-fs] " Sean Christopherson
2020-06-18 12:47             ` Vivek Goyal
2020-06-18 12:47               ` [Virtio-fs] " Vivek Goyal
2020-06-17 20:33     ` Vivek Goyal
2020-06-17 20:33       ` [Virtio-fs] " Vivek Goyal
2020-06-16 21:48 ` [PATCH 3/3] kvm, async_pf: Use FOLL_WRITE only for write faults Vivek Goyal
2020-06-16 21:48   ` [Virtio-fs] " Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.