Re: [PATCH] Don't do panic for memory Fatal UE in ghes of x86_mce platform

* Re: [PATCH] Don't do panic for memory Fatal UE in ghes of x86_mce platform
       [not found] <20201028073540.70136-1-yaoaili126@163.com>
@ 2020-10-28 19:19 ` James Morse
  0 siblings, 0 replies; 2+ messages in thread
From: James Morse @ 2020-10-28 19:19 UTC (permalink / raw)
  To: yaoaili126
  Cc: rjw, lenb, tony.luck, bp, linux-acpi, yangfeng1, CHENGUOMIN, yaoaili

Hi!

On 28/10/2020 07:35, yaoaili126@163.com wrote:
> From: Aili Yao <yaoaili@kingsoft.com>
> 
> For x86 with mce, when BIOS get its work done for memory UE,it will
> raise MCE exception, In MCE, it will do panic or recovery work there.
> But When BIOS option WHEA memory log is enabled,

This is GHES_ASSIST?

> BIOS also prepared one
> detailed error table which will be polled by ghes_notify_nmi from NMI
> watchdog,

heh, because the NMI notification chain has no idea what triggered it...

> ghes_notify_nmi will check the severity and do panic too, this
> panic action is not necessary and confusing, and may lead to unwanted
> results like core dump fail.
> 
> Downgrade CPER_SEV_FATAL to GHES_SEV_RECOVERABLE before panic is called
> for x86_mce

You can't know whether your platform (will?) also generate an MCE when you build the
kernel. Distros set all the Kconfig options, they aren't tuned to the platform.

This also makes fatal memory errors recoverable, which isn't true for other platforms.

Assuming this is GHES_ASSIST, I think the simplest approach is to skip the panic() for
CPER records found for those GHES. APEI is only providing assistance in this case, so its
unfair for it to take some terminal action. The machine-check handler should have the
final say in in this case.

Section 18.7 of ACPI v6.3a says we're supposed to:
| consume the additional GHES_ASSIST information in the context of an error reported
| by hardware
which would be the MCE - so there is no reason for these GHES entries to be poked whenever
an unrelated NMI occurs.

Fixing this would need a separate list for the machine check handler to poke to dump any
data from GHES_ASSIST CPER, so its more work than just skipping the register call.

(I've no idea how x86 prioritises MCE and NMI... does one block the other?)

Thanks,

James

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 81bf71b10d44..e5e8a53beb5a 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -272,7 +272,27 @@ static inline int ghes_severity(int severity)
>  	case CPER_SEV_RECOVERABLE:
>  		return GHES_SEV_RECOVERABLE;
>  	case CPER_SEV_FATAL:
> +	{
> +#ifdef CONFIG_X86_MCE
> +		int sev, sec_sev;
> +		struct acpi_hest_generic_data *gdata;
> +		guid_t *sec_type;
> +
> +		if (estatus == NULL)
> +			return GHES_SEV_PANIC;
> +
> +		apei_estatus_for_each_section(estatus, gdata) {
> +			sec_type = (guid_t *)gdata->section_type;
> +			sec_sev = gdata->error_severity;
> +			if (sec_sev == CPER_SEV_FATAL &&
> +			 !guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM))
> +				return GHES_SEV_PANIC;
> +		}
> +		return GHES_SEV_RECOVERABLE;
> +#else
>  		return GHES_SEV_PANIC;
> +#endif
> +	}

^ permalink raw reply	[flat|nested] 2+ messages in thread