linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2018-12-15  0:12 gengdongjiu
  2018-12-17 15:55 ` James Morse
  0 siblings, 1 reply; 14+ messages in thread
From: gengdongjiu @ 2018-12-15  0:12 UTC (permalink / raw)
  To: Peter Maydell, James Morse
  Cc: Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
> >
> > Hi Dongjiu Geng,
> >
> > On 14/12/2018 10:15, Dongjiu Geng wrote:
> > > When user space do memory recovery, it will check whether KVM and 
> > > guest support the error recovery, only when both of them support, 
> > > user space will do the error recovery. This patch exports this 
> > > capability of KVM to user space.
> >
> > I can understand user-space only wanting to do the work if host and 
> > guest support the feature. But 'error recovery' isn't a KVM feature, 
> > its a Linux kernel feature.
> >
> > KVM will send it's user-space a SIGBUS with MCEERR code whenever its 
> > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> > (e.g. check_user_page_hwpoison(), get_user_pages() returns 
> > -EHWPOISON)
> >
> > This is exactly the same as happens to a normal user-space process.
> >
> > I think you really want to know if the host kernel was built with 
> > CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel 
> wasn't built with that support then it will simply never deliver any memory failure events to QEMU, which is fine.
> 
> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory failure 
> notifications", and so the code path which handles the SIGBUS must do 
> some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to fill in.

Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.

To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic:
1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3].
So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory. 

[1]
#define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF |
                        (cpu->enable_lmce ? MCG_LMCE_P : 0);

[2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);

[3]
env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK;
ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap);

-------------------------------------For James's comments---------------------------------------------------------------------
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm 
> code uses to prevent the page being used again.

> KVM is only involved when it tries to map a page at stage2 and the mm 
> code rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of 
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability.

---------------------------------------------- For James's comments ---------------------------------------------------
> The CPU RAS Extensions are not at all relevant here. It is perfectly 
> possible to support memory-failure without them, AMD-Seattle and 
> APM-X-Gene do this. These systems would report not-supported here, but the kernel does support this stuff.
> Just because the CPU supports this, doesn't mean the kernel was built 
> with CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
--------------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?
In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery[4]. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method.

[4]:
u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;

long kvm_arch_dev_ioctl(struct file *filp,
			unsigned int ioctl, unsigned long arg)
{
    ............................................................................
	case KVM_X86_GET_MCE_CAP_SUPPORTED: {
		r = -EFAULT;
		if (copy_to_user(argp, &kvm_mce_cap_supported,
				 sizeof(kvm_mce_cap_supported)))
			goto out;
		r = 0;
		break;
	}
    .......................................................................
}

> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2019-01-10 15:41 gengdongjiu
  0 siblings, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2019-01-10 15:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: James Morse, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Thu, 10 Jan 2019 at 12:09, gengdongjiu <gengdongjiu@huawei.com> wrote:
> > Peter, I summarize James's main idea, James think QEMU does not 
> > needs to check *something* if Qemu support firmware-first.
> > What do we do for your comments?
> 
> Unless I'm missing something, the code in your most recent patchset 
> attempts to update an ACPI table when it gets the SIGBUS from the host 
> kernel without doing anything to check whether it has ever created the ACPI table (and set up the QEMU global variable that tells the code where it is in the guest memory) in the first place.

when QEMU version is greater than some version, it will default create the APEI table. But only when the guest is booted by UEFI, it will support to record the CPER to guest memory. 
In my test, I boot guest using UEFI, so it is no problem, I will check whether this booting uses UEFI before update the ACPI table.

> I don't see how that can work.	
> 
> > >> I think one question here which it would be good to answer is:
> > >> if we are modelling a guest and we haven't specifically provided 
> > >> it an ACPI table to tell it about memory errors, what do we do 
> > >> when we get a sigbus from the host? We have basically two choices:
> > >>  (1) send the guest an SError (aka asynchronous external abort)
> > >>      anyway (with no further info about what the memory error is)
> > >
> > > For an AR signal an external abort is valid. Its up to the 
> > > implementation whether these are synchronous or asynchronous. Qemu 
> > > can only take a signal for something that was synchronous, so you can choose between the two.
> > > Synchronous external abort is marginally better as an unaware OS 
> > > knows its affects this thread, and may be able to kill it.
> > > SError with an imp-def ESR is indistinguishable from 'part of the 
> > > soc fell out', and should always result in a panic().
> > >
> > >
> > >>  (2) just stop QEMU (as we would for a memory error in QEMU's
> > >>      own memory)
> > >
> > > This is also valid. A machine may take external-abort to EL3 and 
> > > then reboot/crash/burn.
> 
> We should decide which of these we want to do, and have a comment 
> explaining what we're doing. If I'm reading your current patchset correctly, it does neither -- if it can't record the fault in the ACPI table it just ignores it without either stopping QEMU or delivering an SError.

James may not know my detailed implementation in the QEMU. In my patch, I only handle the BUS_MCEERR_AR SIGBUS signal(synchronous signal). when the SIGBUS is BUS_MCEERR_AR, it will deliver a synchronous exception abort.
James said it needs to deliver an SError when the BUS_MCEERR_OR SIGBUS signal(asynchronous signal), but I do not handle the this case because QEMU main thread will mask this asynchronous signal.

If the memory error is belong to QEMU itself, I just print an error log[2]. If you think, it should stop QEMU for this case, I will change it.

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) {
	..................
   	if (code == BUS_MCEERR_AR) {
    	kvm_cpu_synchronize_state(c);
       	if (ghes_record_errors(ACPI_HEST_NOTIFY_SEA, paddr)) {
     		kvm_inject_arm_sea(c);
     	} else {
       		fprintf(stderr, "failed to record the error\n");
       	}
	}
[2]	fprintf(stderr, "Hardware memory error for memory used by "
                "QEMU itself instead of guest system!\n"); 
}

> 
> I think I favour option (2) here.
> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2019-01-10 15:30 gengdongjiu
  0 siblings, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2019-01-10 15:30 UTC (permalink / raw)
  To: Peter Maydell
  Cc: James Morse, Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Thu, 10 Jan 2019 at 12:09, gengdongjiu <gengdongjiu@huawei.com> wrote:
> > Peter, I summarize James's main idea, James think QEMU does not needs
> > to check *something* if Qemu support firmware-first.
> > What do we do for your comments?
> 
> Unless I'm missing something, the code in your most recent patchset attempts to update an ACPI table when it gets the SIGBUS from the
> host kernel without doing anything to check whether it has ever created the ACPI table (and set up the QEMU global variable that tells the
> code where it is in the guest memory) in the first place.

when QEMU version is greater than some version, it will default create the APEI table. But only when the guest is booted by UEFI, it will support to record the CPER to guest memory. 
In my test, I boot guest using UEFI, so it is no problem, I will check whether this booting uses UEFI before update the ACPI table.

> I don't see how that can work.	
> 
> > >> I think one question here which it would be good to answer is:
> > >> if we are modelling a guest and we haven't specifically provided it
> > >> an ACPI table to tell it about memory errors, what do we do when we
> > >> get a sigbus from the host? We have basically two choices:
> > >>  (1) send the guest an SError (aka asynchronous external abort)
> > >>      anyway (with no further info about what the memory error is)
> > >
> > > For an AR signal an external abort is valid. Its up to the
> > > implementation whether these are synchronous or asynchronous. Qemu
> > > can only take a signal for something that was synchronous, so you can choose between the two.
> > > Synchronous external abort is marginally better as an unaware OS
> > > knows its affects this thread, and may be able to kill it.
> > > SError with an imp-def ESR is indistinguishable from 'part of the
> > > soc fell out', and should always result in a panic().
> > >
> > >
> > >>  (2) just stop QEMU (as we would for a memory error in QEMU's
> > >>      own memory)
> > >
> > > This is also valid. A machine may take external-abort to EL3 and
> > > then reboot/crash/burn.
> 
> We should decide which of these we want to do, and have a comment explaining what we're doing. If I'm reading your current patchset
> correctly, it does neither -- if it can't record the fault in the ACPI table it just ignores it without either stopping QEMU or delivering an SError.

James may not know my detailed implementation in the QEMU. In my patch, I only handle the BUS_MCEERR_AR SIGBUS signal(synchronous signal). when the SIGBUS is BUS_MCEERR_AR, it will deliver a synchronous exception abort.
James said it needs to deliver an SError when the BUS_MCEERR_OR SIGBUS signal(synchronous signal), but I do not handle the this case because QEMU main thread will mask this asynchronous signal.

If the memory error is belong to QEMU itself, I just print an error log[2]. If you think, it should stop QEMU for this case, I will change it.

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) {
	..................
   	if (code == BUS_MCEERR_AR) {
    	kvm_cpu_synchronize_state(c);
       	if (ghes_record_errors(ACPI_HEST_NOTIFY_SEA, paddr)) {
     		kvm_inject_arm_sea(c);
     	} else {
       		fprintf(stderr, "failed to record the error\n");
       	}
	}
[2]	fprintf(stderr, "Hardware memory error for memory used by "
                "QEMU itself instead of guest system!\n");
}
> 
> I think I favour option (2) here.
> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2018-12-15  0:06 gengdongjiu
  0 siblings, 0 replies; 14+ messages in thread
From: gengdongjiu @ 2018-12-15  0:06 UTC (permalink / raw)
  To: Peter Maydell, James Morse
  Cc: Radim Krčmář,
	Jonathan Corbet, Christoffer Dall, Marc Zyngier, Catalin Marinas,
	Will Deacon, kvm-devel, open list:DOCUMENTATION,
	lkml - Kernel Mailing List, arm-mail-list

> 
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
> >
> > Hi Dongjiu Geng,
> >
> > On 14/12/2018 10:15, Dongjiu Geng wrote:
> > > When user space do memory recovery, it will check whether KVM and
> > > guest support the error recovery, only when both of them support,
> > > user space will do the error recovery. This patch exports this
> > > capability of KVM to user space.
> >
> > I can understand user-space only wanting to do the work if host and
> > guest support the feature. But 'error recovery' isn't a KVM feature,
> > its a Linux kernel feature.
> >
> > KVM will send it's user-space a SIGBUS with MCEERR code whenever its
> > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> > (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
> >
> > This is exactly the same as happens to a normal user-space process.
> >
> > I think you really want to know if the host kernel was built with
> > CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel wasn't built with that support then it will simply never deliver any
> memory failure events to QEMU, which is fine.
> 
> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory failure notifications", and so the code path which handles the SIGBUS must do
> some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to
> fill in.

Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.

To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic:
1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1]
2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3].
So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory. 

[1]
#define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF |
                        (cpu->enable_lmce ? MCG_LMCE_P : 0);

[2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);

[3]
env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK;
ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap);

-------------------------------------For James's comments---------------------------------------------------------------------
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm code uses
> to prevent the page being used again.

> KVM is only involved when it tries to map a page at stage2 and the mm code
> rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability.

---------------------------------------------- For James's comments ---------------------------------------------------
> The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
> support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
> systems would report not-supported here, but the kernel does support this stuff.
> Just because the CPU supports this, doesn't mean the kernel was built with
> CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
--------------------------------------------------------------------------------------------------------------------------------------
James, for your above comments[4], if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?
In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method.

[4]:
u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;

long kvm_arch_dev_ioctl(struct file *filp,
			unsigned int ioctl, unsigned long arg)
{
	case KVM_X86_GET_MCE_CAP_SUPPORTED: {
		r = -EFAULT;
		if (copy_to_user(argp, &kvm_mce_cap_supported,
				 sizeof(kvm_mce_cap_supported)))
			goto out;
		r = 0;
		break;
	}
}

> 
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 14+ messages in thread
* [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space
@ 2018-12-14 10:15 Dongjiu Geng
  2018-12-14 13:55 ` James Morse
  0 siblings, 1 reply; 14+ messages in thread
From: Dongjiu Geng @ 2018-12-14 10:15 UTC (permalink / raw)
  To: peter.maydell, rkrcmar, corbet, christoffer.dall, marc.zyngier,
	catalin.marinas, will.deacon, kvm, linux-doc, linux-kernel,
	linux-arm-kernel

When user space do memory recovery, it will check whether KVM and
guest support the error recovery, only when both of them support,
user space will do the error recovery. This patch exports this
capability of KVM to user space.

Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
---
User space needs to check this capability of KVM is suggested by Peter[1],
this patch as RFC tag because user space patches are still under review,
so this kernel patch is firstly sent out for review.

[1]: https://patchwork.codeaurora.org/patch/652261/
---
 Documentation/virtual/kvm/api.txt | 9 +++++++++
 arch/arm64/kvm/reset.c            | 1 +
 include/uapi/linux/kvm.h          | 1 +
 3 files changed, 11 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index cd209f7..241e2e2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4895,3 +4895,12 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
+8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
+
+Architectures: arm, arm64
+
+This capability indicates that guest memory error can be detected by the KVM which
+supports the error recovery. When user space do recovery, such as QEMU, it will
+check whether KVM and guest support memory error recovery, only when both of them
+support, user space will do the error recovery.
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index b72a3dd..90d1d9a 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = kvm_arm_support_pmu_v3();
 		break;
 	case KVM_CAP_ARM_INJECT_SERROR_ESR:
+	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
 		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2b7a652..3b19580 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -975,6 +975,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
 #define KVM_CAP_EXCEPTION_PAYLOAD 164
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
+#define KVM_CAP_ARM_MEMORY_ERROR_RECOVERY 166
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-01-10 15:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-15  0:12 [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space gengdongjiu
2018-12-17 15:55 ` James Morse
2018-12-19 19:02   ` Peter Maydell
2018-12-21 18:17     ` James Morse
2019-01-10 12:09       ` gengdongjiu
2019-01-10 13:25         ` Peter Maydell
  -- strict thread matches above, loose matches on Subject: below --
2019-01-10 15:41 gengdongjiu
2019-01-10 15:30 gengdongjiu
2018-12-15  0:06 gengdongjiu
2018-12-14 10:15 Dongjiu Geng
2018-12-14 13:55 ` James Morse
2018-12-14 14:33   ` Peter Maydell
2018-12-17 15:55     ` James Morse
2018-12-14 22:31   ` gengdongjiu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).