[PATCH] acpi: apei: call into AER handling regardless of severity

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] acpi: apei: call into AER handling regardless of severity
@ 2017-08-28 17:11 Tyler Baicar
  2017-08-28 20:52 ` Rafael J. Wysocki
  2017-08-29  8:20 ` Borislav Petkov
  0 siblings, 2 replies; 16+ messages in thread
From: Tyler Baicar @ 2017-08-28 17:11 UTC (permalink / raw)
  To: rjw, lenb, will.deacon, james.morse, bp, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel
  Cc: Tyler Baicar

Currently the GHES code only calls into the AER driver for
recoverable type errors. This is incorrect because errors of
other severities do not get logged by the AER driver and do not
get exposed to user space via the AER trace event. So, call
into the AER driver for PCIe errors regardless of the severity.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d661d45..5cab238 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -489,9 +489,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
 
-			if (sev == GHES_SEV_RECOVERABLE &&
-			    sec_sev == GHES_SEV_RECOVERABLE &&
-			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+			if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
 			    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
 				unsigned int devfn;
 				int aer_severity;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-28 17:11 [PATCH] acpi: apei: call into AER handling regardless of severity Tyler Baicar
@ 2017-08-28 20:52 ` Rafael J. Wysocki
  2017-08-29  8:20 ` Borislav Petkov
  1 sibling, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2017-08-28 20:52 UTC (permalink / raw)
  To: Tyler Baicar, bp
  Cc: lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel

On Monday, August 28, 2017 7:11:54 PM CEST Tyler Baicar wrote:
> Currently the GHES code only calls into the AER driver for
> recoverable type errors. This is incorrect because errors of
> other severities do not get logged by the AER driver and do not
> get exposed to user space via the AER trace event. So, call
> into the AER driver for PCIe errors regardless of the severity.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>

Boris?

> ---
>  drivers/acpi/apei/ghes.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d661d45..5cab238 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -489,9 +489,7 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
>  
> -			if (sev == GHES_SEV_RECOVERABLE &&
> -			    sec_sev == GHES_SEV_RECOVERABLE &&
> -			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> +			if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>  			    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
>  				unsigned int devfn;
>  				int aer_severity;
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-28 17:11 [PATCH] acpi: apei: call into AER handling regardless of severity Tyler Baicar
  2017-08-28 20:52 ` Rafael J. Wysocki
@ 2017-08-29  8:20 ` Borislav Petkov
  2017-08-29 21:27   ` Baicar, Tyler
  1 sibling, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2017-08-29  8:20 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: rjw, lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel

On Mon, Aug 28, 2017 at 11:11:54AM -0600, Tyler Baicar wrote:
> Currently the GHES code only calls into the AER driver for
> recoverable type errors. This is incorrect because errors of
> other severities do not get logged by the AER driver and do not
> get exposed to user space via the AER trace event. So, call
> into the AER driver for PCIe errors regardless of the severity.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d661d45..5cab238 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -489,9 +489,7 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
>  
> -			if (sev == GHES_SEV_RECOVERABLE &&
> -			    sec_sev == GHES_SEV_RECOVERABLE &&

Did you make the effort to see which commit added those lines and read
its commit message?

Doesn't look like it...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-29  8:20 ` Borislav Petkov
@ 2017-08-29 21:27   ` Baicar, Tyler
  2017-08-29 22:19     ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Baicar, Tyler @ 2017-08-29 21:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: rjw, lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel

On 8/29/2017 2:20 AM, Borislav Petkov wrote:
> On Mon, Aug 28, 2017 at 11:11:54AM -0600, Tyler Baicar wrote:
>> Currently the GHES code only calls into the AER driver for
>> recoverable type errors. This is incorrect because errors of
>> other severities do not get logged by the AER driver and do not
>> get exposed to user space via the AER trace event. So, call
>> into the AER driver for PCIe errors regardless of the severity.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> ---
>>   drivers/acpi/apei/ghes.c | 4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index d661d45..5cab238 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -489,9 +489,7 @@ static void ghes_do_proc(struct ghes *ghes,
>>   		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>>   			struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
>>   
>> -			if (sev == GHES_SEV_RECOVERABLE &&
>> -			    sec_sev == GHES_SEV_RECOVERABLE &&
> Did you make the effort to see which commit added those lines and read
> its commit message?
>
> Doesn't look like it...
Hello Boris,

Here is that commit text:

"ACPI, APEI, GHES: Add PCIe AER recovery support

     aer_recover_queue() is called when recoverable PCIe AER errors are
     notified by firmware to do the recovery work."

The function with the real bulk of the code we need here is 
aer_recover_work_func() which calls into cper_print_aer() and 
do_recovery(). The do_recovery() function is the only function that 
should be specific to recoverable errors. We need cper_print_aer() to 
handle printing of AER specific information and to trigger the aer_event 
to notify user space. Otherwise tools such as RAS Daemon will not be 
notified of correctable type PCIe errors. You can clearly see by looking 
at cper_print_aer() that it expects to be called with correctable errors 
as well. To avoid calling the do_recovery() function for correctable 
errors I created https://patchwork.kernel.org/patch/9925877/

The AER core framework for non-FF systems prints all the AER error 
information for all errors and then only calls do_recovery() for 
non-correctable errors. See aer_process_err_devices() and 
handle_error_source().

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-29 21:27   ` Baicar, Tyler
@ 2017-08-29 22:19     ` Borislav Petkov
  2017-08-29 22:34       ` Sinan Kaya
  2017-08-29 23:06         ` Luck, Tony
  0 siblings, 2 replies; 16+ messages in thread
From: Borislav Petkov @ 2017-08-29 22:19 UTC (permalink / raw)
  To: Baicar, Tyler, Tony Luck
  Cc: rjw, lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel

On Tue, Aug 29, 2017 at 03:27:42PM -0600, Baicar, Tyler wrote:
> To avoid calling the
> do_recovery() function for correctable errors I created
> https://patchwork.kernel.org/patch/9925877/

enum {
        GHES_SEV_NO = 0x0,
        GHES_SEV_CORRECTED = 0x1,
        GHES_SEV_RECOVERABLE = 0x2,
        GHES_SEV_PANIC = 0x3,
};

>From all those severity types above, you want to do recovery for
GHES_SEV_RECOVERABLE but print *all* severities. Yes? I mean, this is
what makes most sense: you want to dump all errors but try to recover
from those from which you *actually* have the possibility to do so.

Looking at the severities conversion, GHES_SEV_RECOVERABLE is
CPER_SEV_RECOVERABLE. cper_severity_to_aer() converts then
CPER_SEV_RECOVERABLE to AER_NONFATAL.

  [ Btw, this is the dumbest sh*t ever. Three different severities!!!
    Looks like someone has won a contest of how to design something as
    needlessly complex as possible. ]

So it looks to me like you want to do rather:

	if (entry.severity == AER_NONFATAL)
		do_recovery(pdev, entry.severity);

which should correspond to the GHES_SEV_RECOVERABLE. And this would be
the straight-forward thing and that would be fine but...

... that is still not 100% equivalent because the check is:

	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE...

so there's the severity of the estatus block and then the severity of
each section successively.

And I have no idea why we're doing this.

Because if we have to keep this, then the above simplification won't work and
you'll have to pass in a separate argument to aer_recover_queue():

	aer_recover_queue( ..., sev == GHES_SEV_RECOVERABLE &&
				sec_sev == GHES_SEV_RECOVERABLE, ...

which, if true, would mean, do recovery.

So let's find out first why do we have to look at both severities.

Tony, any ideas?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-29 22:19     ` Borislav Petkov
@ 2017-08-29 22:34       ` Sinan Kaya
  2017-08-30 10:16         ` Borislav Petkov
  2017-08-29 23:06         ` Luck, Tony
  1 sibling, 1 reply; 16+ messages in thread
From: Sinan Kaya @ 2017-08-29 22:34 UTC (permalink / raw)
  To: Borislav Petkov, Baicar, Tyler, Tony Luck
  Cc: rjw, lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel,
	Linux PCI

+linux-pci

On 8/29/2017 6:19 PM, Borislav Petkov wrote:
> if (entry.severity == AER_NONFATAL)
> 		do_recovery(pdev, entry.severity);

Providing input from PCI perspective only:

PCIe defines the the following error categories:

1. Correctable error
2. Uncorrectable non-fatal errors
3. Uncorrectable fatal errors

The do_recovery function needs to be called for both uncorrectable error
categories. (#2 and #3 above)

How these map to GHES error categories is out of know-how.

Below are references to some of the PCIe error docs:

https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt
https://www.kernel.org/doc/Documentation/PCI/pci-error-recovery.txt

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-29 22:19     ` Borislav Petkov
@ 2017-08-29 23:06         ` Luck, Tony
  2017-08-29 23:06         ` Luck, Tony
  1 sibling, 0 replies; 16+ messages in thread
From: Luck, Tony @ 2017-08-29 23:06 UTC (permalink / raw)
  To: Borislav Petkov, Baicar, Tyler
  Cc: rjw, lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel

> So let's find out first why do we have to look at both severities.
>
> Tony, any ideas?

Sorry, no ideas. I haven't tried to dig through this mess to take action ... just
skimmed through the code a while back when we were just logging the
errors.

-Tony
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] acpi: apei: call into AER handling regardless of severity
@ 2017-08-29 23:06         ` Luck, Tony
  0 siblings, 0 replies; 16+ messages in thread
From: Luck, Tony @ 2017-08-29 23:06 UTC (permalink / raw)
  To: Borislav Petkov, Baicar, Tyler
  Cc: rjw, lenb, will.deacon, james.morse, prarit, punit.agrawal,
	shiju.jose, andriy.shevchenko, linux-acpi, linux-kernel

> So let's find out first why do we have to look at both severities.
>
> Tony, any ideas?

Sorry, no ideas. I haven't tried to dig through this mess to take action ... just
skimmed through the code a while back when we were just logging the
errors.

-Tony
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-29 22:34       ` Sinan Kaya
@ 2017-08-30 10:16         ` Borislav Petkov
  2017-08-30 14:05           ` Sinan Kaya
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2017-08-30 10:16 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Baicar, Tyler, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On Tue, Aug 29, 2017 at 06:34:49PM -0400, Sinan Kaya wrote:
> The do_recovery function needs to be called for both uncorrectable error
> categories. (#2 and #3 above)

Care to share why exactly that needs to happen?

Because I'm reading this in pcieaer-howto.txt:

"If an error message indicates a non-fatal error, performing link reset
at upstream is not required."

and

"If an error message indicates a fatal error, kernel will broadcast
error_detected(dev, pci_channel_io_frozen) to all drivers within a
hierarchy in question. Then, performing link reset at upstream is
necessary."

Now, pci-error-recovery.txt has link reset as step 3 so I'm assuming
recovery means link reset. And thus, non-fatal AER errors are not
required to do recovery but fatal are.

> How these map to GHES error categories is out of know-how.

        case CPER_SEV_INFORMATIONAL:
                return GHES_SEV_NO;
        case CPER_SEV_CORRECTED:
                return GHES_SEV_CORRECTED;
        case CPER_SEV_RECOVERABLE:
                return GHES_SEV_RECOVERABLE;
        case CPER_SEV_FATAL:
                return GHES_SEV_PANIC;

and

        case CPER_SEV_RECOVERABLE:
                return AER_NONFATAL;
        case CPER_SEV_FATAL:
                return AER_FATAL;
        default:
                return AER_CORRECTABLE;

So I see GHES_SEV_RECOVERABLE -> CPER_SEV_RECOVERABLE -> AER_NONFATAL.

Which means, we've never done error recovery for AER_FATAL errors. Which
we should've been doing in the first place! Unless...

... Error recovery for those fatal errors has been happening down the
other, PCI path:

aer_isr->aer_isr_one_error->...->do_recovery()

Which then makes me look at this contraption in the ghes code:

config ACPI_APEI_PCIEAER
        bool "APEI PCIe AER logging/recovering support"
        depends on ACPI_APEI && PCIEAER
        help
          PCIe AER errors may be reported via APEI firmware first mode.
          Turn on this option to enable the corresponding support.

So this says "may be" reported.

Now the question is, what kind of errors are being reported through here
and what exactly are we expected to do about them? Print them? Or do
more?

Hmmm.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 10:16         ` Borislav Petkov
@ 2017-08-30 14:05           ` Sinan Kaya
  2017-08-30 15:16             ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Sinan Kaya @ 2017-08-30 14:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Baicar, Tyler, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On 8/30/2017 6:16 AM, Borislav Petkov wrote:
> On Tue, Aug 29, 2017 at 06:34:49PM -0400, Sinan Kaya wrote:
>> The do_recovery function needs to be called for both uncorrectable error
>> categories. (#2 and #3 above)
> 
> Care to share why exactly that needs to happen?

Let me try below:

> 
> Because I'm reading this in pcieaer-howto.txt:
> 
> "If an error message indicates a non-fatal error, performing link reset
> at upstream is not required."
> 

Link reset is not the only recovery mechanism. In the case
of nonfatal errors, it is assumed that the endpoint CSR is still reachable. 
Error is propagated the PCIe endpoint driver. Endpoint driver does a
re-initialization, we are back in business.

> and
> 
> "If an error message indicates a fatal error, kernel will broadcast
> error_detected(dev, pci_channel_io_frozen) to all drivers within a
> hierarchy in question. Then, performing link reset at upstream is
> necessary."
> 
> Now, pci-error-recovery.txt has link reset as step 3 so I'm assuming
> recovery means link reset. And thus, non-fatal AER errors are not
> required to do recovery but fatal are.

Nope, link reset is the second approach. Endpoint recovery won't happen
if the endpoint CSRs are unreachable. We have to recover the link by
doing a link reset in this case. 

Therefore, different behavior in do_recovery function based on the
severity of the error.

> 
>> How these map to GHES error categories is out of know-how.
> 
>         case CPER_SEV_INFORMATIONAL:
>                 return GHES_SEV_NO;
>         case CPER_SEV_CORRECTED:
>                 return GHES_SEV_CORRECTED;
>         case CPER_SEV_RECOVERABLE:
>                 return GHES_SEV_RECOVERABLE;
>         case CPER_SEV_FATAL:
>                 return GHES_SEV_PANIC;
> 
> and
> 
>         case CPER_SEV_RECOVERABLE:
>                 return AER_NONFATAL;
>         case CPER_SEV_FATAL:
>                 return AER_FATAL;
>         default:
>                 return AER_CORRECTABLE;
> 

Thanks for the mapping.

> So I see GHES_SEV_RECOVERABLE -> CPER_SEV_RECOVERABLE -> AER_NONFATAL.
> 
> Which means, we've never done error recovery for AER_FATAL errors. Which
> we should've been doing in the first place! Unless...
That's not true. The GHES code is changing the severity here before posting
to the AER driver in ghes_do_proc().

	if (gdata->flags & CPER_SEC_RESET)
		aer_severity = AER_FATAL;

A PCIe error could be AER_FATAL but it is recoverable. It does not need to
crash the system.

Here is another mapping below based on my understanding.

GHES_SEV_RECOVERABLE -> CPER_SEV_RECOVERABLE -> CPER_SEC_RESET-> AER_FATAL

> 
> ... Error recovery for those fatal errors has been happening down the
> other, PCI path:
> 
> aer_isr->aer_isr_one_error->...->do_recovery()

No, AER ISR is not set up if firmware first is enabled.

> 
> Which then makes me look at this contraption in the ghes code:
> 
> config ACPI_APEI_PCIEAER
>         bool "APEI PCIe AER logging/recovering support"
>         depends on ACPI_APEI && PCIEAER
>         help
>           PCIe AER errors may be reported via APEI firmware first mode.
>           Turn on this option to enable the corresponding support.
> 
> So this says "may be" reported.
> 
> Now the question is, what kind of errors are being reported through here
> and what exactly are we expected to do about them? Print them? Or do
> more?

The behavior should match non firmware-first case ideally.

1. Print all correctable errors.
2. Go to do_recovery for all uncorrectable errors including fatal and
non-fatal. 

This is also what AER driver does in the absence of firmware first via
handle_error_source().

> 
> Hmmm.
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 14:05           ` Sinan Kaya
@ 2017-08-30 15:16             ` Borislav Petkov
  2017-08-30 15:31               ` Sinan Kaya
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2017-08-30 15:16 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Baicar, Tyler, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote:
> Link reset is not the only recovery mechanism. In the case of nonfatal
> errors, it is assumed that the endpoint CSR is still reachable.
> Error is propagated the PCIe endpoint driver. Endpoint driver does a
> re-initialization, we are back in business.

I'm assuming that's broadcast_error_message()'s job.

> That's not true. The GHES code is changing the severity here before posting
> to the AER driver in ghes_do_proc().
> 
> 	if (gdata->flags & CPER_SEC_RESET)
> 		aer_severity = AER_FATAL;

You're missing the point that we would walk into that if branch *only* for

                        if (sev == GHES_SEV_RECOVERABLE &&
                            sec_sev == GHES_SEV_RECOVERABLE

severities. So if you have an AER_FATAL error but ghes severities are
not GHES_SEV_RECOVERABLE, nothing happens.

> No, AER ISR is not set up if firmware first is enabled.

So then this is a major suckage. We do AER recovery on FF systems only
for GHES_SEV_RECOVERABLE severity.

> The behavior should match non firmware-first case ideally.
> 
> 1. Print all correctable errors.
> 2. Go to do_recovery for all uncorrectable errors including fatal and
> non-fatal. 
> 
> This is also what AER driver does in the absence of firmware first via
> handle_error_source().

Yes, that makes sense.

Which would mean that we'd call aer_recover_queue() regardless of GHES
severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set
or CPER_SEC_RESET. I.e., we can communicate all that by setting the
correct AER severity before calling aer_recover_queue(). And then call
do_recovery() based on AER severity.

Hmmm?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 15:16             ` Borislav Petkov
@ 2017-08-30 15:31               ` Sinan Kaya
  2017-08-30 15:42                 ` Baicar, Tyler
  2017-08-30 17:02                 ` Borislav Petkov
  0 siblings, 2 replies; 16+ messages in thread
From: Sinan Kaya @ 2017-08-30 15:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Baicar, Tyler, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On 8/30/2017 11:16 AM, Borislav Petkov wrote:
> On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote:
>> Link reset is not the only recovery mechanism. In the case of nonfatal
>> errors, it is assumed that the endpoint CSR is still reachable.
>> Error is propagated the PCIe endpoint driver. Endpoint driver does a
>> re-initialization, we are back in business.
> 
> I'm assuming that's broadcast_error_message()'s job.
> 

That's right. Each driver provides an err_handler hook. broadcast function
calls these. 

static struct pci_driver e1000_driver = {
	..
	.err_handler = &e1000_err_handler
};

struct pci_error_handlers {
	...
	pci_ers_result_t (*error_detected)(struct pci_dev *dev,
					   enum pci_channel_state error);
}


>> That's not true. The GHES code is changing the severity here before posting
>> to the AER driver in ghes_do_proc().
>>
>> 	if (gdata->flags & CPER_SEC_RESET)
>> 		aer_severity = AER_FATAL;
> 
> You're missing the point that we would walk into that if branch *only* for
> 
>                         if (sev == GHES_SEV_RECOVERABLE &&
>                             sec_sev == GHES_SEV_RECOVERABLE
> 
> severities. So if you have an AER_FATAL error but ghes severities are
> not GHES_SEV_RECOVERABLE, nothing happens.

I see. We should probably try to do something only if GHES_SEV_CORRECTED or
GHES_SEV_RECOVERABLE.

If somebody wants to crash the system with GHES_SEV_PANIC, there is no point
in doing additional work.

> 
>> No, AER ISR is not set up if firmware first is enabled.
> 
> So then this is a major suckage. We do AER recovery on FF systems only
> for GHES_SEV_RECOVERABLE severity.
> 
>> The behavior should match non firmware-first case ideally.
>>
>> 1. Print all correctable errors.
>> 2. Go to do_recovery for all uncorrectable errors including fatal and
>> non-fatal. 
>>
>> This is also what AER driver does in the absence of firmware first via
>> handle_error_source().
> 
> Yes, that makes sense.
> 
> Which would mean that we'd call aer_recover_queue() regardless of GHES
> severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set
> or CPER_SEC_RESET. I.e., we can communicate all that by setting the
> correct AER severity before calling aer_recover_queue(). And then call
> do_recovery() based on AER severity.
> 
> Hmmm?
> 

Sounds good. Do you still want to do PCIe recovery in the case of
GHES_SEV_PANIC or if some FW returns GHES_SEV_NO?

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 15:31               ` Sinan Kaya
@ 2017-08-30 15:42                 ` Baicar, Tyler
  2017-08-30 17:14                   ` Borislav Petkov
  2017-08-30 17:02                 ` Borislav Petkov
  1 sibling, 1 reply; 16+ messages in thread
From: Baicar, Tyler @ 2017-08-30 15:42 UTC (permalink / raw)
  To: Sinan Kaya, Borislav Petkov
  Cc: Tony Luck, rjw, lenb, will.deacon, james.morse, prarit,
	punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On 8/30/2017 9:31 AM, Sinan Kaya wrote:
> On 8/30/2017 11:16 AM, Borislav Petkov wrote:
>> On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote:
>>> Link reset is not the only recovery mechanism. In the case of nonfatal
>>> errors, it is assumed that the endpoint CSR is still reachable.
>>> Error is propagated the PCIe endpoint driver. Endpoint driver does a
>>> re-initialization, we are back in business.
>> I'm assuming that's broadcast_error_message()'s job.
>>
> That's right. Each driver provides an err_handler hook. broadcast function
> calls these.
>
> static struct pci_driver e1000_driver = {
> 	..
> 	.err_handler = &e1000_err_handler
> };
>
> struct pci_error_handlers {
> 	...
> 	pci_ers_result_t (*error_detected)(struct pci_dev *dev,
> 					   enum pci_channel_state error);
> }
>
>
>>> That's not true. The GHES code is changing the severity here before posting
>>> to the AER driver in ghes_do_proc().
>>>
>>> 	if (gdata->flags & CPER_SEC_RESET)
>>> 		aer_severity = AER_FATAL;
>> You're missing the point that we would walk into that if branch *only* for
>>
>>                          if (sev == GHES_SEV_RECOVERABLE &&
>>                              sec_sev == GHES_SEV_RECOVERABLE
>>
>> severities. So if you have an AER_FATAL error but ghes severities are
>> not GHES_SEV_RECOVERABLE, nothing happens.
> I see. We should probably try to do something only if GHES_SEV_CORRECTED or
> GHES_SEV_RECOVERABLE.
>
> If somebody wants to crash the system with GHES_SEV_PANIC, there is no point
> in doing additional work.
See below.
>>> No, AER ISR is not set up if firmware first is enabled.
>> So then this is a major suckage. We do AER recovery on FF systems only
>> for GHES_SEV_RECOVERABLE severity.
>>
>>> The behavior should match non firmware-first case ideally.
>>>
>>> 1. Print all correctable errors.
>>> 2. Go to do_recovery for all uncorrectable errors including fatal and
>>> non-fatal.
>>>
>>> This is also what AER driver does in the absence of firmware first via
>>> handle_error_source().
>> Yes, that makes sense.
>>
>> Which would mean that we'd call aer_recover_queue() regardless of GHES
>> severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set
>> or CPER_SEC_RESET. I.e., we can communicate all that by setting the
>> correct AER severity before calling aer_recover_queue(). And then call
>> do_recovery() based on AER severity.
>>
>> Hmmm?
>>
> Sounds good. Do you still want to do PCIe recovery in the case of
> GHES_SEV_PANIC or if some FW returns GHES_SEV_NO?
>
We do not need to worry about the GHES_SEV_PANIC case. Those get sent to 
__ghes_panic() in ghes_proc() without even making it to ghes_do_proc(). 
Those errors are just printed and then the kernel panics.

I think with my two patches we will have the desired functionality:

GHES_SEV_CORRECTABLE -> AER_CORRECTABLE -> Print AER info, but do not 
call do_recovery

GHES_SEV_RECOVERABLE -> AER_NONFATAL -> Print AER info and do_recovery

GHES_RECOVERABLE and CPER_SEC_RESET -> AER_FATAL -> Print AER info and 
do_recover

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 15:31               ` Sinan Kaya
  2017-08-30 15:42                 ` Baicar, Tyler
@ 2017-08-30 17:02                 ` Borislav Petkov
  1 sibling, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2017-08-30 17:02 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Baicar, Tyler, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On Wed, Aug 30, 2017 at 11:31:06AM -0400, Sinan Kaya wrote:
> I see. We should probably try to do something only if GHES_SEV_CORRECTED or
> GHES_SEV_RECOVERABLE.
> 
> If somebody wants to crash the system with GHES_SEV_PANIC, there is no point
> in doing additional work.

Makes sense.

Whatever we do, I'd like to have this all nicely documented *why* we're
doing the recovery policy we're doing.

> Sounds good. Do you still want to do PCIe recovery in the case of
> GHES_SEV_PANIC or if some FW returns GHES_SEV_NO?

So I read GHES_SEV_PANIC as: we should panic and stop any processing
whatsoever ASAP in order to avoid further error propagation. So doing
recovery there might *actually* be a bad idea.

GHES_SEV_NO would map to AER_CORRECTABLE and I think that would mean,
print the error to let the user know but no need to recover because no
harm was done.

I *think*.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 15:42                 ` Baicar, Tyler
@ 2017-08-30 17:14                   ` Borislav Petkov
  2017-08-30 18:09                     ` Baicar, Tyler
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2017-08-30 17:14 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: Sinan Kaya, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On Wed, Aug 30, 2017 at 09:42:08AM -0600, Baicar, Tyler wrote:
> I think with my two patches we will have the desired functionality:
> 
> GHES_SEV_CORRECTABLE -> AER_CORRECTABLE -> Print AER info, but do not call
> do_recovery
> 
> GHES_SEV_RECOVERABLE -> AER_NONFATAL -> Print AER info and do_recovery
> 
> GHES_RECOVERABLE and CPER_SEC_RESET -> AER_FATAL -> Print AER info and
> do_recover

Right, so I'd like to you create a separate function ghes_do_proc_aer()
or ghes_handle_aer() or so and carve out all the code inside #ifdef
CONFIG_ACPI_APEI_PCIEAER into it, add your two changes to the patch and
slap a big fat comment above the new function explaining *why* we're
doing what we're doing and how we're mapping all the severities to AER
severity in order to do recovery and/or only to print the error.

So that it is known and people can see in the future why we're doing
this and what the logic has been and what kind of policy we're chasing
and so on and so on...

Ok?

Thanks.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] acpi: apei: call into AER handling regardless of severity
  2017-08-30 17:14                   ` Borislav Petkov
@ 2017-08-30 18:09                     ` Baicar, Tyler
  0 siblings, 0 replies; 16+ messages in thread
From: Baicar, Tyler @ 2017-08-30 18:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sinan Kaya, Tony Luck, rjw, lenb, will.deacon, james.morse,
	prarit, punit.agrawal, shiju.jose, andriy.shevchenko, linux-acpi,
	linux-kernel, Linux PCI, Huang Ying

On 8/30/2017 11:14 AM, Borislav Petkov wrote:
> On Wed, Aug 30, 2017 at 09:42:08AM -0600, Baicar, Tyler wrote:
>> I think with my two patches we will have the desired functionality:
>>
>> GHES_SEV_CORRECTABLE -> AER_CORRECTABLE -> Print AER info, but do not call
>> do_recovery
>>
>> GHES_SEV_RECOVERABLE -> AER_NONFATAL -> Print AER info and do_recovery
>>
>> GHES_RECOVERABLE and CPER_SEC_RESET -> AER_FATAL -> Print AER info and
>> do_recover
> Right, so I'd like to you create a separate function ghes_do_proc_aer()
> or ghes_handle_aer() or so and carve out all the code inside #ifdef
> CONFIG_ACPI_APEI_PCIEAER into it, add your two changes to the patch and
> slap a big fat comment above the new function explaining *why* we're
> doing what we're doing and how we're mapping all the severities to AER
> severity in order to do recovery and/or only to print the error.
>
> So that it is known and people can see in the future why we're doing
> this and what the logic has been and what kind of policy we're chasing
> and so on and so on...
>
> Ok?
>
> Thanks.
Yes, I can do that.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-08-30 18:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-28 17:11 [PATCH] acpi: apei: call into AER handling regardless of severity Tyler Baicar
2017-08-28 20:52 ` Rafael J. Wysocki
2017-08-29  8:20 ` Borislav Petkov
2017-08-29 21:27   ` Baicar, Tyler
2017-08-29 22:19     ` Borislav Petkov
2017-08-29 22:34       ` Sinan Kaya
2017-08-30 10:16         ` Borislav Petkov
2017-08-30 14:05           ` Sinan Kaya
2017-08-30 15:16             ` Borislav Petkov
2017-08-30 15:31               ` Sinan Kaya
2017-08-30 15:42                 ` Baicar, Tyler
2017-08-30 17:14                   ` Borislav Petkov
2017-08-30 18:09                     ` Baicar, Tyler
2017-08-30 17:02                 ` Borislav Petkov
2017-08-29 23:06       ` Luck, Tony
2017-08-29 23:06         ` Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.