linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2018-12-14 10:15 Dongjiu Geng
  2018-12-14 13:55 ` James Morse
  0 siblings, 1 reply; 14+ messages in thread
From: Dongjiu Geng @ 2018-12-14 10:15 UTC (permalink / raw)
  To: peter.maydell, rkrcmar, corbet, christoffer.dall, marc.zyngier,
	catalin.marinas, will.deacon, kvm, linux-doc, linux-kernel,
	linux-arm-kernel

When user space do memory recovery, it will check whether KVM and
guest support the error recovery, only when both of them support,
user space will do the error recovery. This patch exports this
capability of KVM to user space.

Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
---
User space needs to check this capability of KVM is suggested by Peter[1],
this patch as RFC tag because user space patches are still under review,
so this kernel patch is firstly sent out for review.

[1]: https://patchwork.codeaurora.org/patch/652261/
---
 Documentation/virtual/kvm/api.txt | 9 +++++++++
 arch/arm64/kvm/reset.c            | 1 +
 include/uapi/linux/kvm.h          | 1 +
 3 files changed, 11 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index cd209f7..241e2e2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4895,3 +4895,12 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
+8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
+
+Architectures: arm, arm64
+
+This capability indicates that guest memory error can be detected by the KVM which
+supports the error recovery. When user space do recovery, such as QEMU, it will
+check whether KVM and guest support memory error recovery, only when both of them
+support, user space will do the error recovery.
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index b72a3dd..90d1d9a 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = kvm_arm_support_pmu_v3();
 		break;
 	case KVM_CAP_ARM_INJECT_SERROR_ESR:
+	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
 		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2b7a652..3b19580 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -975,6 +975,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
 #define KVM_CAP_EXCEPTION_PAYLOAD 164
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
+#define KVM_CAP_ARM_MEMORY_ERROR_RECOVERY 166
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-14 10:15 [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space Dongjiu Geng
@ 2018-12-14 13:55 ` James Morse
  2018-12-14 14:33   ` Peter Maydell
  2018-12-14 22:31   ` gengdongjiu
  0 siblings, 2 replies; 14+ messages in thread
From: James Morse @ 2018-12-14 13:55 UTC (permalink / raw)
  To: Dongjiu Geng
  Cc: peter.maydell, rkrcmar, corbet, christoffer.dall, marc.zyngier,
	catalin.marinas, will.deacon, kvm, linux-doc, linux-kernel,
	linux-arm-kernel

Hi Dongjiu Geng,

On 14/12/2018 10:15, Dongjiu Geng wrote:
> When user space do memory recovery, it will check whether KVM and
> guest support the error recovery, only when both of them support,
> user space will do the error recovery. This patch exports this
> capability of KVM to user space.

I can understand user-space only wanting to do the work if host and guest
support the feature. But 'error recovery' isn't a KVM feature, its a Linux
kernel feature.

KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
map a page at stage2 that the kernel-mm code refuses this because its poisoned.
(e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)

This is exactly the same as happens to a normal user-space process.

I think you really want to know if the host kernel was built with
CONFIG_MEMORY_FAILURE. The not-at-all-portable way to tell this from user-space
is the presence of /proc/sys/vm/memory_failure_* files.
(It looks like the prctl():PR_MCE_KILL/PR_MCE_KILL_GET options silently update
an ignored policy if the kernel isn't built with CONFIG_MEMORY_FAILURE, so they
aren't helpful)


> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index cd209f7..241e2e2 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4895,3 +4895,12 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
> +8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
> +
> +Architectures: arm, arm64
> +
> +This capability indicates that guest memory error can be detected by the KVM which
> +supports the error recovery.

KVM doesn't detect these errors.
The hardware detects them and notifies the OS via one of a number of mechanisms.
This gets plumbed into memory_failure(), which sets a flag that the mm code uses
to prevent the page being used again.

KVM is only involved when it tries to map a page at stage2 and the mm code
rejects it with -EHWPOISON. This is the same as the architectures
do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.


> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index b72a3dd..90d1d9a 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		r = kvm_arm_support_pmu_v3();
>  		break;
>  	case KVM_CAP_ARM_INJECT_SERROR_ESR:
> +	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
>  		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>  		break;

The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
systems would report not-supported here, but the kernel does support this stuff.
Just because the CPU supports this, doesn't mean the kernel was built with
CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.



Thanks,

James

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-14 13:55 ` James Morse
@ 2018-12-14 14:33   ` Peter Maydell
  2018-12-17 15:55     ` James Morse
  2018-12-14 22:31   ` gengdongjiu
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Maydell @ 2018-12-14 14:33 UTC (permalink / raw)
  To: James Morse
  Cc: gengdongjiu, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
>
> Hi Dongjiu Geng,
>
> On 14/12/2018 10:15, Dongjiu Geng wrote:
> > When user space do memory recovery, it will check whether KVM and
> > guest support the error recovery, only when both of them support,
> > user space will do the error recovery. This patch exports this
> > capability of KVM to user space.
>
> I can understand user-space only wanting to do the work if host and guest
> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
> kernel feature.
>
> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
> map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>
> This is exactly the same as happens to a normal user-space process.
>
> I think you really want to know if the host kernel was built with
> CONFIG_MEMORY_FAILURE.

Does userspace need to care about that? Presumably if the host kernel
wasn't built with that support then it will simply never deliver
any memory failure events to QEMU, which is fine.

The point I was trying to make in the email Dongjiu references
(https://patchwork.codeaurora.org/patch/652261/) is simply that
"QEMU gets memory-failure notifications from the host kernel"
does not imply "the guest is prepared to receive memory
failure notifications", and so the code path which handles
the SIGBUS must do some kind of check for whether the guest
CPU is a type which expects them and that the board code
set up the ACPI tables that it wants to fill in.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-14 13:55 ` James Morse
  2018-12-14 14:33   ` Peter Maydell
@ 2018-12-14 22:31   ` gengdongjiu
  1 sibling, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2018-12-14 22:31 UTC (permalink / raw)
  To: James Morse
  Cc: Dongjiu Geng, peter.maydell, rkrcmar, corbet, christoffer.dall,
	marc.zyngier, catalin.marinas, will.deacon, kvm, linux-doc,
	linux-kernel, linux-arm-kernel

HI James,

      Thanks for the mail and comments, I will reply to you in the next mail.

2018-12-14 21:55 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi Dongjiu Geng,
>
> On 14/12/2018 10:15, Dongjiu Geng wrote:
>> When user space do memory recovery, it will check whether KVM and
>> guest support the error recovery, only when both of them support,
>> user space will do the error recovery. This patch exports this
>> capability of KVM to user space.
>
> I can understand user-space only wanting to do the work if host and guest
> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
> kernel feature.
>
> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying
> to
> map a page at stage2 that the kernel-mm code refuses this because its
> poisoned.
> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>
> This is exactly the same as happens to a normal user-space process.
>
> I think you really want to know if the host kernel was built with
> CONFIG_MEMORY_FAILURE. The not-at-all-portable way to tell this from
> user-space
> is the presence of /proc/sys/vm/memory_failure_* files.
> (It looks like the prctl():PR_MCE_KILL/PR_MCE_KILL_GET options silently
> update
> an ignored policy if the kernel isn't built with CONFIG_MEMORY_FAILURE, so
> they
> aren't helpful)
>
>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index cd209f7..241e2e2 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -4895,3 +4895,12 @@ Architectures: x86
>>  This capability indicates that KVM supports paravirtualized Hyper-V IPI
>> send
>>  hypercalls:
>>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
>> +
>> +8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
>> +
>> +Architectures: arm, arm64
>> +
>> +This capability indicates that guest memory error can be detected by the
>> KVM which
>> +supports the error recovery.
>
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of
> mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm code
> uses
> to prevent the page being used again.
>
> KVM is only involved when it tries to map a page at stage2 and the mm code
> rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
>
>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index b72a3dd..90d1d9a 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>>  		r = kvm_arm_support_pmu_v3();
>>  		break;
>>  	case KVM_CAP_ARM_INJECT_SERROR_ESR:
>> +	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
>>  		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>>  		break;
>
> The CPU RAS Extensions are not at all relevant here. It is perfectly
> possible to
> support memory-failure without them, AMD-Seattle and APM-X-Gene do this.
> These
> systems would report not-supported here, but the kernel does support this
> stuff.
> Just because the CPU supports this, doesn't mean the kernel was built with
> CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to
> SIGKILL.
>
>
>
> Thanks,
>
> James
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-14 14:33   ` Peter Maydell
@ 2018-12-17 15:55     ` James Morse
  0 siblings, 0 replies; 14+ messages in thread
From: James Morse @ 2018-12-17 15:55 UTC (permalink / raw)
  To: Peter Maydell
  Cc: gengdongjiu, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

Hi Peter,

On 14/12/2018 14:33, Peter Maydell wrote:
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
>> On 14/12/2018 10:15, Dongjiu Geng wrote:
>>> When user space do memory recovery, it will check whether KVM and
>>> guest support the error recovery, only when both of them support,
>>> user space will do the error recovery. This patch exports this
>>> capability of KVM to user space.
>>
>> I can understand user-space only wanting to do the work if host and guest
>> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
>> kernel feature.
>>
>> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
>> map a page at stage2 that the kernel-mm code refuses this because its poisoned.
>> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>>
>> This is exactly the same as happens to a normal user-space process.
>>
>> I think you really want to know if the host kernel was built with
>> CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel
> wasn't built with that support then it will simply never deliver
> any memory failure events to QEMU, which is fine.

Aha, I thought this is what you wanted.
Always being prepared to handle the signals is the best choice.


> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that
> "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory
> failure notifications", and so the code path which handles
> the SIGBUS must do some kind of check for whether the guest

> CPU is a type which expects them

I don't understand this bit.

The CPU support is just about barriers for containment and reporting a
standardised classification to software. Firmware-first replaces all this. It
doesn't depend on any CPU feature.
APM-X-Gene has firmware-first support, it uses some kind of external processor
that takes the error-interrupt from DRAM and generates CPER records, before
triggering the firmware-first notification.

> and that the board code
> set up the ACPI tables that it wants to fill in.
ACPI has some complex stuff around claiming 'platform-wide capabilities'. Qemu
could use this to know if the guest understands APEI.

Section 6.2.11.2 "Platform-Wide OSPM Capabilities" of ACPI v6.2 describes the
\_SB._OSC method, which has an APEI support bit. This is used in some kind of
handshake.

Linux does this during boot if its built with APEI GHES support. Linux seems to
think the APEI bit enables firmware-first:
| [   63.804907] GHES: APEI firmware first mode is enabled by APEI bit.

... but its not clear from the spec. (APEI is more than firmware-first)

(where do these things go? Platform AML in the DSDT)


I don't think this controls anything on a real system, (we've seen X-Gene
generate CPER records before Linux started booting), and I don't think it really
matters as 'what happens if the guest doesn't know' falls out of the way these
SIGBUS codes map back onto the firmware-first notifications:

For 'AO' signals you can dump CPER records in a NOTIFY_POLLed area. If the guest
doesn't care, it can avert is eyes. If you used one of the NOTIFY_$(interrupt)
types, the guest can not-register the interrupt.

The AR signals map to external-abort. On a firmware-first system EL3 takes
these, generates some extra metadata using CPER records in the agreed location,
and re-injects an emulated external-abort.
If Qemu takes an AR signal, this is effectively an external-abort, the page has
been accessed and the kernel will not map it because the page is poisoned. These
would have been an external-abort on a real system, its not a problem if the
guest doesn't know about the extra CPER metadata.

Centriq is an example of a system that does this external-abort+CPER-metadata
without the v8.2 CPU extensions.

All v8.0 CPUs have synchronous/asynchronous external abort, there is nothing new
going on here, its just extra metadata. (critically: the physical address of the
fault)


Thanks,

James

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2019-01-10 15:41 gengdongjiu
  0 siblings, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2019-01-10 15:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: James Morse, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Thu, 10 Jan 2019 at 12:09, gengdongjiu <gengdongjiu@huawei.com> wrote:
> > Peter, I summarize James's main idea, James think QEMU does not 
> > needs to check *something* if Qemu support firmware-first.
> > What do we do for your comments?
> 
> Unless I'm missing something, the code in your most recent patchset 
> attempts to update an ACPI table when it gets the SIGBUS from the host 
> kernel without doing anything to check whether it has ever created the ACPI table (and set up the QEMU global variable that tells the code where it is in the guest memory) in the first place.

when QEMU version is greater than some version, it will default create the APEI table. But only when the guest is booted by UEFI, it will support to record the CPER to guest memory. 
In my test, I boot guest using UEFI, so it is no problem, I will check whether this booting uses UEFI before update the ACPI table.

> I don't see how that can work.	
> 
> > >> I think one question here which it would be good to answer is:
> > >> if we are modelling a guest and we haven't specifically provided 
> > >> it an ACPI table to tell it about memory errors, what do we do 
> > >> when we get a sigbus from the host? We have basically two choices:
> > >>  (1) send the guest an SError (aka asynchronous external abort)
> > >>      anyway (with no further info about what the memory error is)
> > >
> > > For an AR signal an external abort is valid. Its up to the 
> > > implementation whether these are synchronous or asynchronous. Qemu 
> > > can only take a signal for something that was synchronous, so you can choose between the two.
> > > Synchronous external abort is marginally better as an unaware OS 
> > > knows its affects this thread, and may be able to kill it.
> > > SError with an imp-def ESR is indistinguishable from 'part of the 
> > > soc fell out', and should always result in a panic().
> > >
> > >
> > >>  (2) just stop QEMU (as we would for a memory error in QEMU's
> > >>      own memory)
> > >
> > > This is also valid. A machine may take external-abort to EL3 and 
> > > then reboot/crash/burn.
> 
> We should decide which of these we want to do, and have a comment 
> explaining what we're doing. If I'm reading your current patchset correctly, it does neither -- if it can't record the fault in the ACPI table it just ignores it without either stopping QEMU or delivering an SError.

James may not know my detailed implementation in the QEMU. In my patch, I only handle the BUS_MCEERR_AR SIGBUS signal(synchronous signal). when the SIGBUS is BUS_MCEERR_AR, it will deliver a synchronous exception abort.
James said it needs to deliver an SError when the BUS_MCEERR_OR SIGBUS signal(asynchronous signal), but I do not handle the this case because QEMU main thread will mask this asynchronous signal.

If the memory error is belong to QEMU itself, I just print an error log[2]. If you think, it should stop QEMU for this case, I will change it.

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) {
	..................
   	if (code == BUS_MCEERR_AR) {
    	kvm_cpu_synchronize_state(c);
       	if (ghes_record_errors(ACPI_HEST_NOTIFY_SEA, paddr)) {
     		kvm_inject_arm_sea(c);
     	} else {
       		fprintf(stderr, "failed to record the error\n");
       	}
	}
[2]	fprintf(stderr, "Hardware memory error for memory used by "
                "QEMU itself instead of guest system!\n"); 
}

> 
> I think I favour option (2) here.
> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2019-01-10 15:30 gengdongjiu
  0 siblings, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2019-01-10 15:30 UTC (permalink / raw)
  To: Peter Maydell
  Cc: James Morse, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Thu, 10 Jan 2019 at 12:09, gengdongjiu <gengdongjiu@huawei.com> wrote:
> > Peter, I summarize James's main idea, James think QEMU does not needs
> > to check *something* if Qemu support firmware-first.
> > What do we do for your comments?
> 
> Unless I'm missing something, the code in your most recent patchset attempts to update an ACPI table when it gets the SIGBUS from the
> host kernel without doing anything to check whether it has ever created the ACPI table (and set up the QEMU global variable that tells the
> code where it is in the guest memory) in the first place.

when QEMU version is greater than some version, it will default create the APEI table. But only when the guest is booted by UEFI, it will support to record the CPER to guest memory. 
In my test, I boot guest using UEFI, so it is no problem, I will check whether this booting uses UEFI before update the ACPI table.

> I don't see how that can work.	
> 
> > >> I think one question here which it would be good to answer is:
> > >> if we are modelling a guest and we haven't specifically provided it
> > >> an ACPI table to tell it about memory errors, what do we do when we
> > >> get a sigbus from the host? We have basically two choices:
> > >>  (1) send the guest an SError (aka asynchronous external abort)
> > >>      anyway (with no further info about what the memory error is)
> > >
> > > For an AR signal an external abort is valid. Its up to the
> > > implementation whether these are synchronous or asynchronous. Qemu
> > > can only take a signal for something that was synchronous, so you can choose between the two.
> > > Synchronous external abort is marginally better as an unaware OS
> > > knows its affects this thread, and may be able to kill it.
> > > SError with an imp-def ESR is indistinguishable from 'part of the
> > > soc fell out', and should always result in a panic().
> > >
> > >
> > >>  (2) just stop QEMU (as we would for a memory error in QEMU's
> > >>      own memory)
> > >
> > > This is also valid. A machine may take external-abort to EL3 and
> > > then reboot/crash/burn.
> 
> We should decide which of these we want to do, and have a comment explaining what we're doing. If I'm reading your current patchset
> correctly, it does neither -- if it can't record the fault in the ACPI table it just ignores it without either stopping QEMU or delivering an SError.

James may not know my detailed implementation in the QEMU. In my patch, I only handle the BUS_MCEERR_AR SIGBUS signal(synchronous signal). when the SIGBUS is BUS_MCEERR_AR, it will deliver a synchronous exception abort.
James said it needs to deliver an SError when the BUS_MCEERR_OR SIGBUS signal(synchronous signal), but I do not handle the this case because QEMU main thread will mask this asynchronous signal.

If the memory error is belong to QEMU itself, I just print an error log[2]. If you think, it should stop QEMU for this case, I will change it.

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) {
	..................
   	if (code == BUS_MCEERR_AR) {
    	kvm_cpu_synchronize_state(c);
       	if (ghes_record_errors(ACPI_HEST_NOTIFY_SEA, paddr)) {
     		kvm_inject_arm_sea(c);
     	} else {
       		fprintf(stderr, "failed to record the error\n");
       	}
	}
[2]	fprintf(stderr, "Hardware memory error for memory used by "
                "QEMU itself instead of guest system!\n");
}
> 
> I think I favour option (2) here.
> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2019-01-10 12:09       ` gengdongjiu
@ 2019-01-10 13:25         ` Peter Maydell
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Maydell @ 2019-01-10 13:25 UTC (permalink / raw)
  To: gengdongjiu
  Cc: James Morse, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

On Thu, 10 Jan 2019 at 12:09, gengdongjiu <gengdongjiu@huawei.com> wrote:
> Peter, I summarize James's main idea, James think QEMU does not needs
> to check *something* if Qemu support firmware-first.
> What do we do for your comments?

Unless I'm missing something, the code in your most recent patchset
attempts to update an ACPI table when it gets the SIGBUS from the
host kernel without doing anything to check whether it has ever
created the ACPI table (and set up the QEMU global variable that
tells the code where it is in the guest memory) in the first place.
I don't see how that can work.

> >> I think one question here which it would be good to answer is:
> >> if we are modelling a guest and we haven't specifically provided
> >> it an ACPI table to tell it about memory errors, what do we do
> >> when we get a sigbus from the host? We have basically two choices:
> >>  (1) send the guest an SError (aka asynchronous external abort)
> >>      anyway (with no further info about what the memory error is)
> >
> > For an AR signal an external abort is valid. Its up to the implementation
> > whether these are synchronous or asynchronous. Qemu can only take a signal for
> > something that was synchronous, so you can choose between the two.
> > Synchronous external abort is marginally better as an unaware OS knows its
> > affects this thread, and may be able to kill it.
> > SError with an imp-def ESR is indistinguishable from 'part of the soc fell out',
> > and should always result in a panic().
> >
> >
> >>  (2) just stop QEMU (as we would for a memory error in QEMU's
> >>      own memory)
> >
> > This is also valid. A machine may take external-abort to EL3 and then
> > reboot/crash/burn.

We should decide which of these we want to do, and have a comment
explaining what we're doing. If I'm reading your current patchset
correctly, it does neither -- if it can't record the fault in the
ACPI table it just ignores it without either stopping QEMU or
delivering an SError.

I think I favour option (2) here.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-21 18:17     ` James Morse
@ 2019-01-10 12:09       ` gengdongjiu
  2019-01-10 13:25         ` Peter Maydell
  0 siblings, 1 reply; 14+ messages in thread
From: gengdongjiu @ 2019-01-10 12:09 UTC (permalink / raw)
  To: James Morse, Peter Maydell
  Cc: Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

Hi James/Peter,
   thanks for this discussion, and sorry for my late response due to vacation.

On 2018/12/22 2:17, James Morse wrote:
> Hi Peter,
> 
> On 19/12/2018 19:02, Peter Maydell wrote:
>> On Mon, 17 Dec 2018 at 15:56, James Morse <james.morse@arm.com> wrote:
>>> I don't think this really matters. Its only the NMIlike notifications that the
>>> guest doesn't have to register or poll. The ones we support today extend the
>>> architectures existing behaviour: you would have taken an external-abort on a
>>> real system, whether you know about the additional metadata doesn't matter to Qemu.
>>
>> Consider the case where we booted the guest using a DTB and no ACPI
>> table at all -- we certainly can't just call QEMU code that tries to
>> add entries to a nonexistent table.
> 
> Sure, because you know which of the two sets of firmware-table you're providing.
> 
> I'm taking the behaviour of physical machines as the template for what we should
> do here. I can boot a DT-only kernel on Seattle. Firmware has no idea I did
> this, it will still take DRAM uncorrected-error IRQs in firmware, and generate
> CPER records in the POLLed areas. But the kernel will never look, because it
> booted with DT.
> What happens if the kernel goes on to access the corrupt location? It either
> gets corrupt values back, or an external abort, depending on the design of the
> memory-controller.
> 
> X-gene uses an IRQ for its firmware-first notification. Booted with DT that
> interrupt can be asserted, but as the OS has didn't know to register it, its
> never taken. We eventually get the same corrupt-values/external-abort behaviour.
> 
> KVM/Linux is acting as the memory controller using stage2. When an error is
> detected by the host it unmaps the page from stage2, and refuses to map it again
> until its fixed up in Qemu's memory map (which can happen automatically). If the
> kernel can't fix it itself, the AO signal is like the DRAM-IRQ above, and the AR
> like the external abort.
> We don't have a parallel to the 'gets corrupt values back' behaviour as Linux
> will always unmap hwpoison pages from user-space/guests.
> 
> If the host-kernel wasn't build with CONFIG_MEMORY_FAILURE, its like the memory
> controller doesn't support any of the above. I think knowing this is the closest
> to what you want.
> 
> 
>> My main point is that there
>> needs to be logic in Dongjiu's QEMU patches that checks more than
>> just "does this KVM feature exist". I'm not sufficiently familiar
>> with all this RAS stuff to be certain what those checks should
>> be and what the right choices are; I just know we need to check
>> *something*...
> 
> I think this is the crux of where we don't see this the same way.
> The v8.2 RAS stuff is new, RAS support on arm64 is not. Kernel support arrived
> at roughly the same time, but not CPU support. There are v8.0 systems that
> support RAS. There are DT systems that can do the same with edac drivers.
> The physical v8.0 systems that do this, are doing it without any extra CPU support.
> 
> I think x86's behaviour here includes some history, which we don't have.
>>From the order of the HEST entries, it looks like the machine-check stuff came
> first, then firmware-first using a 'GHES' entry in that table.
> I think Qemu on x86 only supports the emulated machine check stuff, so it needs
> to check KVM has the widget to do this.
> If Qemu on x86 supported firmware-first, I don't think there would be anything
> to check. (details below)

Peter, I summarize James's main idea, James think QEMU does not needs to check *something* if Qemu support firmware-first.
What do we do for your comments?

> 
> 
>>>> Let us see the X86's QEMU logic:
>>>> 1. Before the vCPU created, it will set a default env->mcg_cap value with
>>>
>>>> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
>>>> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host
>>>> kernel support this feature[2]. Only when host kernel and default env->mcg_cap
>>>> value all expected this feature, then it will setup vCPU support RAS error
>>>> recovery[3].
>>>
>>> This looks like KVM exposing a CPU capability to Qemu, which then configures the
>>> behaviour KVM gives to the guest. This doesn't tell you anything about what the
>>> guest supports.
>>
>> It tells you what the *guest CPU* supports, which for x86 is a combination
>> of (a) what did the user/machine model ask for and (b) what can KVM
>> actually implement. I don't much care whether the guest OS supports
>> anything or not, that's its business... but it does seem odd to me
>> that the equivalent Arm code is not similarly saying "what were we
>> asked for, and what can we do?".
> 
> The flow is something like:
> For AO, generate CPER records, and notify the OS via NOTIFY_POLL (which isn't
> really a notification) or some flavour of IRQ.
> To do this, Qemu needs to be able to write to its reserved area of guest memory,
> and possibly trigger an interrupt.
> 
> For AR, generate CPER records and notify the OS via external abort. (the
> presence of the CPER records makes this NOTIFY_SEA or NOTIFY_SEI).
> To do this, Qemu again needs to be able to write to guest memory, set guest
> registers (KVM_SET_ONE_REG()). If it wants to inject an
> SError-Interrupt/Asynchronous-external-abort while the guest has it masked, it
> needs KVM_SET_VCPU_EVENTS().
> 
> Nothing here depends on the CPU or kernel configuration. This is all ACPI stuff,
> so its the same on x86. (The only difference is external-abort becomes NMI,
> which is probably done through SET_VCPU_EVENTS())
> 
> What were we asked for? Qemu wants to know if it can write to guest memory,
> guest registers (for synchronous external abort) and trigger interrupts. It has
> always been able to do these things.
> 
> 
>> I think one question here which it would be good to answer is:
>> if we are modelling a guest and we haven't specifically provided
>> it an ACPI table to tell it about memory errors, what do we do
>> when we get a sigbus from the host? We have basically two choices:
>>  (1) send the guest an SError (aka asynchronous external abort)
>>      anyway (with no further info about what the memory error is)
> 
> For an AR signal an external abort is valid. Its up to the implementation
> whether these are synchronous or asynchronous. Qemu can only take a signal for
> something that was synchronous, so you can choose between the two.
> Synchronous external abort is marginally better as an unaware OS knows its
> affects this thread, and may be able to kill it.
> SError with an imp-def ESR is indistinguishable from 'part of the soc fell out',
> and should always result in a panic().
> 
> 
>>  (2) just stop QEMU (as we would for a memory error in QEMU's
>>      own memory)
> 
> This is also valid. A machine may take external-abort to EL3 and then
> reboot/crash/burn.
> 
> 
> Just in case this is the deeper issue: I keep picking on memory-errors, what
> about CPU errors?
> Linux can't handle these at all, unless they are also memory errors. If we take
> an imprecise abort from a guest KVM can't tell Qemu using signals. We don't have
> any mechanism to tell user-space about imprecise exceptions. In this case KVM
> throws an imp-def SError back at the affected vcpu, these are allowed to be
> imprecise, as this is the closest thing we have.
> 
> This does mean that any AO/AR signal Qemu gets is a memory error.
> 
> 
> Happy New Year,
> 
> James
> 
> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-19 19:02   ` Peter Maydell
@ 2018-12-21 18:17     ` James Morse
  2019-01-10 12:09       ` gengdongjiu
  0 siblings, 1 reply; 14+ messages in thread
From: James Morse @ 2018-12-21 18:17 UTC (permalink / raw)
  To: Peter Maydell
  Cc: gengdongjiu, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

Hi Peter,

On 19/12/2018 19:02, Peter Maydell wrote:
> On Mon, 17 Dec 2018 at 15:56, James Morse <james.morse@arm.com> wrote:
>> I don't think this really matters. Its only the NMIlike notifications that the
>> guest doesn't have to register or poll. The ones we support today extend the
>> architectures existing behaviour: you would have taken an external-abort on a
>> real system, whether you know about the additional metadata doesn't matter to Qemu.
> 
> Consider the case where we booted the guest using a DTB and no ACPI
> table at all -- we certainly can't just call QEMU code that tries to
> add entries to a nonexistent table.

Sure, because you know which of the two sets of firmware-table you're providing.

I'm taking the behaviour of physical machines as the template for what we should
do here. I can boot a DT-only kernel on Seattle. Firmware has no idea I did
this, it will still take DRAM uncorrected-error IRQs in firmware, and generate
CPER records in the POLLed areas. But the kernel will never look, because it
booted with DT.
What happens if the kernel goes on to access the corrupt location? It either
gets corrupt values back, or an external abort, depending on the design of the
memory-controller.

X-gene uses an IRQ for its firmware-first notification. Booted with DT that
interrupt can be asserted, but as the OS has didn't know to register it, its
never taken. We eventually get the same corrupt-values/external-abort behaviour.

KVM/Linux is acting as the memory controller using stage2. When an error is
detected by the host it unmaps the page from stage2, and refuses to map it again
until its fixed up in Qemu's memory map (which can happen automatically). If the
kernel can't fix it itself, the AO signal is like the DRAM-IRQ above, and the AR
like the external abort.
We don't have a parallel to the 'gets corrupt values back' behaviour as Linux
will always unmap hwpoison pages from user-space/guests.

If the host-kernel wasn't build with CONFIG_MEMORY_FAILURE, its like the memory
controller doesn't support any of the above. I think knowing this is the closest
to what you want.


> My main point is that there
> needs to be logic in Dongjiu's QEMU patches that checks more than
> just "does this KVM feature exist". I'm not sufficiently familiar
> with all this RAS stuff to be certain what those checks should
> be and what the right choices are; I just know we need to check
> *something*...

I think this is the crux of where we don't see this the same way.
The v8.2 RAS stuff is new, RAS support on arm64 is not. Kernel support arrived
at roughly the same time, but not CPU support. There are v8.0 systems that
support RAS. There are DT systems that can do the same with edac drivers.
The physical v8.0 systems that do this, are doing it without any extra CPU support.

I think x86's behaviour here includes some history, which we don't have.
From the order of the HEST entries, it looks like the machine-check stuff came
first, then firmware-first using a 'GHES' entry in that table.
I think Qemu on x86 only supports the emulated machine check stuff, so it needs
to check KVM has the widget to do this.
If Qemu on x86 supported firmware-first, I don't think there would be anything
to check. (details below)


>>> Let us see the X86's QEMU logic:
>>> 1. Before the vCPU created, it will set a default env->mcg_cap value with
>>
>>> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
>>> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host
>>> kernel support this feature[2]. Only when host kernel and default env->mcg_cap
>>> value all expected this feature, then it will setup vCPU support RAS error
>>> recovery[3].
>>
>> This looks like KVM exposing a CPU capability to Qemu, which then configures the
>> behaviour KVM gives to the guest. This doesn't tell you anything about what the
>> guest supports.
> 
> It tells you what the *guest CPU* supports, which for x86 is a combination
> of (a) what did the user/machine model ask for and (b) what can KVM
> actually implement. I don't much care whether the guest OS supports
> anything or not, that's its business... but it does seem odd to me
> that the equivalent Arm code is not similarly saying "what were we
> asked for, and what can we do?".

The flow is something like:
For AO, generate CPER records, and notify the OS via NOTIFY_POLL (which isn't
really a notification) or some flavour of IRQ.
To do this, Qemu needs to be able to write to its reserved area of guest memory,
and possibly trigger an interrupt.

For AR, generate CPER records and notify the OS via external abort. (the
presence of the CPER records makes this NOTIFY_SEA or NOTIFY_SEI).
To do this, Qemu again needs to be able to write to guest memory, set guest
registers (KVM_SET_ONE_REG()). If it wants to inject an
SError-Interrupt/Asynchronous-external-abort while the guest has it masked, it
needs KVM_SET_VCPU_EVENTS().

Nothing here depends on the CPU or kernel configuration. This is all ACPI stuff,
so its the same on x86. (The only difference is external-abort becomes NMI,
which is probably done through SET_VCPU_EVENTS())

What were we asked for? Qemu wants to know if it can write to guest memory,
guest registers (for synchronous external abort) and trigger interrupts. It has
always been able to do these things.


> I think one question here which it would be good to answer is:
> if we are modelling a guest and we haven't specifically provided
> it an ACPI table to tell it about memory errors, what do we do
> when we get a sigbus from the host? We have basically two choices:
>  (1) send the guest an SError (aka asynchronous external abort)
>      anyway (with no further info about what the memory error is)

For an AR signal an external abort is valid. Its up to the implementation
whether these are synchronous or asynchronous. Qemu can only take a signal for
something that was synchronous, so you can choose between the two.
Synchronous external abort is marginally better as an unaware OS knows its
affects this thread, and may be able to kill it.
SError with an imp-def ESR is indistinguishable from 'part of the soc fell out',
and should always result in a panic().


>  (2) just stop QEMU (as we would for a memory error in QEMU's
>      own memory)

This is also valid. A machine may take external-abort to EL3 and then
reboot/crash/burn.


Just in case this is the deeper issue: I keep picking on memory-errors, what
about CPU errors?
Linux can't handle these at all, unless they are also memory errors. If we take
an imprecise abort from a guest KVM can't tell Qemu using signals. We don't have
any mechanism to tell user-space about imprecise exceptions. In this case KVM
throws an imp-def SError back at the affected vcpu, these are allowed to be
imprecise, as this is the closest thing we have.

This does mean that any AO/AR signal Qemu gets is a memory error.


Happy New Year,

James

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-17 15:55 ` James Morse
@ 2018-12-19 19:02   ` Peter Maydell
  2018-12-21 18:17     ` James Morse
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Maydell @ 2018-12-19 19:02 UTC (permalink / raw)
  To: James Morse
  Cc: gengdongjiu, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

On Mon, 17 Dec 2018 at 15:56, James Morse <james.morse@arm.com> wrote:
> I think the root issue here is the name of the cpufeature 'RAS Extensions', this
> doesn't mean RAS is new, or even requires these features. It's just standardised
> records, classification and a barrier.
> Not only is it possible to build a platform that supports RAS without this
> extensions: there are at least three platforms out there that do!
>
>
> On 15/12/2018 00:12, gengdongjiu wrote:
> >> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
> >>> On 14/12/2018 10:15, Dongjiu Geng wrote:
> >>>> When user space do memory recovery, it will check whether KVM and
> >>>> guest support the error recovery, only when both of them support,
> >>>> user space will do the error recovery. This patch exports this
> >>>> capability of KVM to user space.
> >>>
> >>> I can understand user-space only wanting to do the work if host and
> >>> guest support the feature. But 'error recovery' isn't a KVM feature,
> >>> its a Linux kernel feature.
>
> [...]
>
> > Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.
> >
> > To James, I explain more to you, as peter said QEMU needs to check whether
> > the guest CPU is a type which can handle the error though guest ACPI table.
>
> I don't think this really matters. Its only the NMIlike notifications that the
> guest doesn't have to register or poll. The ones we support today extend the
> architectures existing behaviour: you would have taken an external-abort on a
> real system, whether you know about the additional metadata doesn't matter to Qemu.

Consider the case where we booted the guest using a DTB and no ACPI
table at all -- we certainly can't just call QEMU code that tries to
add entries to a nonexistent table. My main point is that there
needs to be logic in Dongjiu's QEMU patches that checks more than
just "does this KVM feature exist". I'm not sufficiently familiar
with all this RAS stuff to be certain what those checks should
be and what the right choices are; I just know we need to check
*something*...

> > Let us see the X86's QEMU logic:
> > 1. Before the vCPU created, it will set a default env->mcg_cap value with
>
> > MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
> > RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host
> > kernel support this feature[2]. Only when host kernel and default env->mcg_cap
> > value all expected this feature, then it will setup vCPU support RAS error
> > recovery[3].
>
> This looks like KVM exposing a CPU capability to Qemu, which then configures the
> behaviour KVM gives to the guest. This doesn't tell you anything about what the
> guest supports.

It tells you what the *guest CPU* supports, which for x86 is a combination
of (a) what did the user/machine model ask for and (b) what can KVM
actually implement. I don't much care whether the guest OS supports
anything or not, that's its business... but it does seem odd to me
that the equivalent Arm code is not similarly saying "what were we
asked for, and what can we do?".

I think one question here which it would be good to answer is:
if we are modelling a guest and we haven't specifically provided
it an ACPI table to tell it about memory errors, what do we do
when we get a sigbus from the host? We have basically two choices:
 (1) send the guest an SError (aka asynchronous external abort)
     anyway (with no further info about what the memory error is)
 (2) just stop QEMU (as we would for a memory error in QEMU's
     own memory)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
  2018-12-15  0:12 gengdongjiu
@ 2018-12-17 15:55 ` James Morse
  2018-12-19 19:02   ` Peter Maydell
  0 siblings, 1 reply; 14+ messages in thread
From: James Morse @ 2018-12-17 15:55 UTC (permalink / raw)
  To: gengdongjiu, Peter Maydell
  Cc: Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

Hi gengdongjiu, Peter,

I think the root issue here is the name of the cpufeature 'RAS Extensions', this
doesn't mean RAS is new, or even requires these features. It's just standardised
records, classification and a barrier.
Not only is it possible to build a platform that supports RAS without this
extensions: there are at least three platforms out there that do!


On 15/12/2018 00:12, gengdongjiu wrote:
>> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
>>> On 14/12/2018 10:15, Dongjiu Geng wrote:
>>>> When user space do memory recovery, it will check whether KVM and 
>>>> guest support the error recovery, only when both of them support, 
>>>> user space will do the error recovery. This patch exports this 
>>>> capability of KVM to user space.
>>>
>>> I can understand user-space only wanting to do the work if host and 
>>> guest support the feature. But 'error recovery' isn't a KVM feature, 
>>> its a Linux kernel feature.

[...]

> Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.
> 
> To James, I explain more to you, as peter said QEMU needs to check whether
> the guest CPU is a type which can handle the error though guest ACPI table.

I don't think this really matters. Its only the NMIlike notifications that the
guest doesn't have to register or poll. The ones we support today extend the
architectures existing behaviour: you would have taken an external-abort on a
real system, whether you know about the additional metadata doesn't matter to Qemu.


> Let us see the X86's QEMU logic:
> 1. Before the vCPU created, it will set a default env->mcg_cap value with

> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host
> kernel support this feature[2]. Only when host kernel and default env->mcg_cap
> value all expected this feature, then it will setup vCPU support RAS error
> recovery[3].

This looks like KVM exposing a CPU capability to Qemu, which then configures the
behaviour KVM gives to the guest. This doesn't tell you anything about what the
guest supports. This doesn't tell you if the host-kernel supports
memory_failure(). You can think of this as being equivalent to the VSESR_EL2
support. Just because the CPU has it doesn't mean the host or guest kernel have
been built to know what to do.

I test NOTIFY_SEA by injecting an address into memory_failure() using
CONFIG_HWPOISON_INJECT. This causes kvmtool to take an AR signal next time the
guest accesses the page, which then gets presented to the guest as an
external-abort, with the CPER records describing the abort created by kvmtool.
This is all on v8.0 hardware, nothing about the CPU is relevant here.


> -------------------------------------For James's comments---------------------------------------------------------------------
>> KVM doesn't detect these errors.
>> The hardware detects them and notifies the OS via one of a number of mechanisms.
>> This gets plumbed into memory_failure(), which sets a flag that the mm 
>> code uses to prevent the page being used again.
> 
>> KVM is only involved when it tries to map a page at stage2 and the mm 
>> code rejects it with -EHWPOISON. This is the same as the architectures
>> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of 
>> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
> ------------------------------------------------------------------------------------------------------------------------------
> James, for your above comments, I completed understand, but KVM also delivered
> the SIGBUS,

kvm_send_hwpoison_signal()? This is just making guest-accesses look like
Qemu-acesses to linux. It's just plumbing.

You could just as easily take the signal from memory_failure()s kill_proc() code.


> which means KVM supports guest memory RAS error recovery, so maybe
> we need to tell user space this capability.

It was merged with ARCH_SUPPORTS_MEMORY_FAILURE. You're really asking if the
host kernel supports CONFIG_MEMORY_FAILURE, and its plumbed in in all the right
places.

It's not practical for user-space to know this, handling the signal when it
arrives is the best thing to do.


> ---------------------------------------------- For James's comments ---------------------------------------------------
>> The CPU RAS Extensions are not at all relevant here. It is perfectly 
>> possible to support memory-failure without them, AMD-Seattle and 
>> APM-X-Gene do this. These systems would report not-supported here, but the kernel does support this stuff.
>> Just because the CPU supports this, doesn't mean the kernel was built 
>> with CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
> --------------------------------------------------------------------------------------------------------------------------------------
> James, for your above comments, if you think we should not check the> "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?

> In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error
> software recovery[4]. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)",
> we have to check the hardcode as X86's method.

There is no CPU property that means the platform has RAS support. Platforms can
support RAS for memory errors (which is all we are talking about here) without them.
The guest can't know from a CPU property that the platform supports RAS. If it
finds a HEST with GHES entries it can register interrupts and polling-timers. If
it can probe an edac driver, it can use that.


Thanks,

James

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2018-12-15  0:12 gengdongjiu
  2018-12-17 15:55 ` James Morse
  0 siblings, 1 reply; 14+ messages in thread
From: gengdongjiu @ 2018-12-15  0:12 UTC (permalink / raw)
  To: Peter Maydell, James Morse
  Cc: Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
> >
> > Hi Dongjiu Geng,
> >
> > On 14/12/2018 10:15, Dongjiu Geng wrote:
> > > When user space do memory recovery, it will check whether KVM and 
> > > guest support the error recovery, only when both of them support, 
> > > user space will do the error recovery. This patch exports this 
> > > capability of KVM to user space.
> >
> > I can understand user-space only wanting to do the work if host and 
> > guest support the feature. But 'error recovery' isn't a KVM feature, 
> > its a Linux kernel feature.
> >
> > KVM will send it's user-space a SIGBUS with MCEERR code whenever its 
> > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> > (e.g. check_user_page_hwpoison(), get_user_pages() returns 
> > -EHWPOISON)
> >
> > This is exactly the same as happens to a normal user-space process.
> >
> > I think you really want to know if the host kernel was built with 
> > CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel 
> wasn't built with that support then it will simply never deliver any memory failure events to QEMU, which is fine.
> 
> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory failure 
> notifications", and so the code path which handles the SIGBUS must do 
> some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to fill in.

Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.

To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic:
1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3].
So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory. 

[1]
#define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF |
                        (cpu->enable_lmce ? MCG_LMCE_P : 0);

[2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);

[3]
env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK;
ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap);

-------------------------------------For James's comments---------------------------------------------------------------------
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm 
> code uses to prevent the page being used again.

> KVM is only involved when it tries to map a page at stage2 and the mm 
> code rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of 
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability.

---------------------------------------------- For James's comments ---------------------------------------------------
> The CPU RAS Extensions are not at all relevant here. It is perfectly 
> possible to support memory-failure without them, AMD-Seattle and 
> APM-X-Gene do this. These systems would report not-supported here, but the kernel does support this stuff.
> Just because the CPU supports this, doesn't mean the kernel was built 
> with CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
--------------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?
In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery[4]. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method.

[4]:
u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;

long kvm_arch_dev_ioctl(struct file *filp,
			unsigned int ioctl, unsigned long arg)
{
    ............................................................................
	case KVM_X86_GET_MCE_CAP_SUPPORTED: {
		r = -EFAULT;
		if (copy_to_user(argp, &kvm_mce_cap_supported,
				 sizeof(kvm_mce_cap_supported)))
			goto out;
		r = 0;
		break;
	}
    .......................................................................
}

> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2018-12-15  0:06 gengdongjiu
  0 siblings, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2018-12-15  0:06 UTC (permalink / raw)
  To: Peter Maydell, James Morse
  Cc: Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
> >
> > Hi Dongjiu Geng,
> >
> > On 14/12/2018 10:15, Dongjiu Geng wrote:
> > > When user space do memory recovery, it will check whether KVM and
> > > guest support the error recovery, only when both of them support,
> > > user space will do the error recovery. This patch exports this
> > > capability of KVM to user space.
> >
> > I can understand user-space only wanting to do the work if host and
> > guest support the feature. But 'error recovery' isn't a KVM feature,
> > its a Linux kernel feature.
> >
> > KVM will send it's user-space a SIGBUS with MCEERR code whenever its
> > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> > (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
> >
> > This is exactly the same as happens to a normal user-space process.
> >
> > I think you really want to know if the host kernel was built with
> > CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel wasn't built with that support then it will simply never deliver any
> memory failure events to QEMU, which is fine.
> 
> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory failure notifications", and so the code path which handles the SIGBUS must do
> some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to
> fill in.

Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.

To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic:
1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1]
2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3].
So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory. 

[1]
#define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF |
                        (cpu->enable_lmce ? MCG_LMCE_P : 0);

[2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);

[3]
env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK;
ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap);

-------------------------------------For James's comments---------------------------------------------------------------------
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm code uses
> to prevent the page being used again.

> KVM is only involved when it tries to map a page at stage2 and the mm code
> rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability.

---------------------------------------------- For James's comments ---------------------------------------------------
> The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
> support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
> systems would report not-supported here, but the kernel does support this stuff.
> Just because the CPU supports this, doesn't mean the kernel was built with
> CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
--------------------------------------------------------------------------------------------------------------------------------------
James, for your above comments[4], if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?
In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method.

[4]:
u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;

long kvm_arch_dev_ioctl(struct file *filp,
			unsigned int ioctl, unsigned long arg)
{
	case KVM_X86_GET_MCE_CAP_SUPPORTED: {
		r = -EFAULT;
		if (copy_to_user(argp, &kvm_mce_cap_supported,
				 sizeof(kvm_mce_cap_supported)))
			goto out;
		r = 0;
		break;
	}
}

> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-01-10 15:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-14 10:15 [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space Dongjiu Geng
2018-12-14 13:55 ` James Morse
2018-12-14 14:33   ` Peter Maydell
2018-12-17 15:55     ` James Morse
2018-12-14 22:31   ` gengdongjiu
2018-12-15  0:06 gengdongjiu
2018-12-15  0:12 gengdongjiu
2018-12-17 15:55 ` James Morse
2018-12-19 19:02   ` Peter Maydell
2018-12-21 18:17     ` James Morse
2019-01-10 12:09       ` gengdongjiu
2019-01-10 13:25         ` Peter Maydell
2019-01-10 15:30 gengdongjiu
2019-01-10 15:41 gengdongjiu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).