linux-efi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] efi: cper: print AER info of PCIe fatal error
@ 2019-07-12  2:20 Xiaofei Tan
  2019-07-25 12:44 ` James Morse
  0 siblings, 1 reply; 3+ messages in thread
From: Xiaofei Tan @ 2019-07-12  2:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Xiaofei Tan, linux-acpi, linux-efi, rjw, lenb, tony.luck, bp,
	ying.huang, ross.lagerwall, ard.biesheuvel, lance.ortiz

AER info of PCIe fatal error is not printed in the current driver.
Because APEI driver will panic directly for fatal error, and can't
run to the place of printing AER info.

An example log is as following:
[ 3157.655028] {763}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 11
[ 3157.663610] {763}[Hardware Error]: event severity: fatal
[ 3157.663612] {763}[Hardware Error]:  Error 0, type: fatal
[ 3157.663614] {763}[Hardware Error]:   section_type: PCIe error
[ 3157.680328] {763}[Hardware Error]:   port_type: 0, PCIe end point
[ 3157.680329] {763}[Hardware Error]:   version: 4.0
[ 3157.680332] {763}[Hardware Error]:   command: 0x0000, status: 0x0010
[ 3157.698757] {763}[Hardware Error]:   device_id: 0000:82:00.0
[ 3157.698758] {763}[Hardware Error]:   slot: 0
[ 3157.698759] {763}[Hardware Error]:   secondary_bus: 0x00
[ 3157.698760] {763}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
[ 3157.698761] {763}[Hardware Error]:   class_code: 000002
[ 3157.698825] Kernel panic - not syncing: Fatal hardware error!

This issue was imported by the patch, '37448adfc7ce ("aerdrv: Move
cper_print_aer() call out of interrupt context")'. To fix this issue,
this patch adds print of AER info in cper_print_pcie() for fatal error.

Here is the example log after this patch applied:
[ 7032.893566] {24}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
[ 7032.901965] {24}[Hardware Error]: event severity: fatal
[ 7032.907166] {24}[Hardware Error]:  Error 0, type: fatal
[ 7032.912366] {24}[Hardware Error]:   section_type: PCIe error
[ 7032.917998] {24}[Hardware Error]:   port_type: 0, PCIe end point
[ 7032.923974] {24}[Hardware Error]:   version: 4.0
[ 7032.928569] {24}[Hardware Error]:   command: 0x0546, status: 0x4010
[ 7032.934806] {24}[Hardware Error]:   device_id: 0000:01:00.0
[ 7032.940352] {24}[Hardware Error]:   slot: 0
[ 7032.944514] {24}[Hardware Error]:   secondary_bus: 0x00
[ 7032.949714] {24}[Hardware Error]:   vendor_id: 0x15b3, device_id: 0x1019
[ 7032.956381] {24}[Hardware Error]:   class_code: 000002
[ 7032.961495] {24}[Hardware Error]:   aer_uncor_status: 0x00040000, aer_uncor_mask: 0x00000000
[ 7032.969891] {24}[Hardware Error]:   aer_uncor_severity: 0x00062010
[ 7032.976042] {24}[Hardware Error]:   TLP Header: 000000c0 01010000 00000001 00000000
[ 7032.983663] Kernel panic - not syncing: Fatal hardware error!

Fixes: 37448adfc7ce ("aerdrv: Move cper_print_aer() call out of
interrupt context")

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
---
 drivers/firmware/efi/cper.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 8fa977c..bf8600d 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -390,6 +390,19 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 		printk(
 	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
+	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+	    gdata->error_severity & CPER_SEV_FATAL) {
+		struct aer_capability_regs *aer;
+
+		aer = (struct aer_capability_regs *)pcie->aer_info;
+		printk("%saer_uncor_status: 0x%08x, aer_uncor_mask: 0x%08x\n",
+		       pfx, aer->uncor_status, aer->uncor_mask);
+		printk("%saer_uncor_severity: 0x%08x\n",
+		       pfx, aer->uncor_severity);
+		printk("%sTLP Header: %08x %08x %08x %08x\n", pfx,
+		       aer->header_log.dw0, aer->header_log.dw1,
+		       aer->header_log.dw2, aer->header_log.dw3);
+	}
 }
 
 static void cper_print_tstamp(const char *pfx,
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1] efi: cper: print AER info of PCIe fatal error
  2019-07-12  2:20 [PATCH 1/1] efi: cper: print AER info of PCIe fatal error Xiaofei Tan
@ 2019-07-25 12:44 ` James Morse
  2019-07-25 13:59   ` tanxiaofei
  0 siblings, 1 reply; 3+ messages in thread
From: James Morse @ 2019-07-25 12:44 UTC (permalink / raw)
  To: Xiaofei Tan
  Cc: linux-kernel, linux-acpi, linux-efi, rjw, lenb, tony.luck, bp,
	ying.huang, ross.lagerwall, ard.biesheuvel, lance.ortiz

Hi,

On 12/07/2019 03:20, Xiaofei Tan wrote:
> AER info of PCIe fatal error is not printed in the current driver.
> Because APEI driver will panic directly for fatal error, and can't
> run to the place of printing AER info.
> 
> An example log is as following:
> [ 3157.655028] {763}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 11
> [ 3157.663610] {763}[Hardware Error]: event severity: fatal
> [ 3157.663612] {763}[Hardware Error]:  Error 0, type: fatal
> [ 3157.663614] {763}[Hardware Error]:   section_type: PCIe error
> [ 3157.680328] {763}[Hardware Error]:   port_type: 0, PCIe end point
> [ 3157.680329] {763}[Hardware Error]:   version: 4.0
> [ 3157.680332] {763}[Hardware Error]:   command: 0x0000, status: 0x0010
> [ 3157.698757] {763}[Hardware Error]:   device_id: 0000:82:00.0
> [ 3157.698758] {763}[Hardware Error]:   slot: 0
> [ 3157.698759] {763}[Hardware Error]:   secondary_bus: 0x00
> [ 3157.698760] {763}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
> [ 3157.698761] {763}[Hardware Error]:   class_code: 000002
> [ 3157.698825] Kernel panic - not syncing: Fatal hardware error!
> 
> This issue was imported by the patch, '37448adfc7ce ("aerdrv: Move
> cper_print_aer() call out of interrupt context")'. To fix this issue,
> this patch adds print of AER info in cper_print_pcie() for fatal error.
> 
> Here is the example log after this patch applied:
> [ 7032.893566] {24}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
> [ 7032.901965] {24}[Hardware Error]: event severity: fatal
> [ 7032.907166] {24}[Hardware Error]:  Error 0, type: fatal
> [ 7032.912366] {24}[Hardware Error]:   section_type: PCIe error
> [ 7032.917998] {24}[Hardware Error]:   port_type: 0, PCIe end point
> [ 7032.923974] {24}[Hardware Error]:   version: 4.0
> [ 7032.928569] {24}[Hardware Error]:   command: 0x0546, status: 0x4010
> [ 7032.934806] {24}[Hardware Error]:   device_id: 0000:01:00.0
> [ 7032.940352] {24}[Hardware Error]:   slot: 0
> [ 7032.944514] {24}[Hardware Error]:   secondary_bus: 0x00
> [ 7032.949714] {24}[Hardware Error]:   vendor_id: 0x15b3, device_id: 0x1019
> [ 7032.956381] {24}[Hardware Error]:   class_code: 000002
> [ 7032.961495] {24}[Hardware Error]:   aer_uncor_status: 0x00040000, aer_uncor_mask: 0x00000000
> [ 7032.969891] {24}[Hardware Error]:   aer_uncor_severity: 0x00062010
> [ 7032.976042] {24}[Hardware Error]:   TLP Header: 000000c0 01010000 00000001 00000000
> [ 7032.983663] Kernel panic - not syncing: Fatal hardware error!

> Fixes: 37448adfc7ce ("aerdrv: Move cper_print_aer() call out of
> interrupt context")

(Please put this all on one line)


> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 8fa977c..bf8600d 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -390,6 +390,19 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  		printk(
>  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);

It may be worth a comment explaining why we only do this for fatal errors. Something like:
| /* Fatal errors call __ghes_panic() before the AER handler gets to print this */


> +	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO &&
> +	    gdata->error_severity & CPER_SEV_FATAL) {
> +		struct aer_capability_regs *aer;
> +
> +		aer = (struct aer_capability_regs *)pcie->aer_info;
> +		printk("%saer_uncor_status: 0x%08x, aer_uncor_mask: 0x%08x\n",

The convention in the rest of the file is for the prefix format string to be separate. i.e:
| "%s""aer_uncor_status: ..."

Could it be the same for consistency?

> +		       pfx, aer->uncor_status, aer->uncor_mask);
> +		printk("%saer_uncor_severity: 0x%08x\n",
> +		       pfx, aer->uncor_severity);
> +		printk("%sTLP Header: %08x %08x %08x %08x\n", pfx,
> +		       aer->header_log.dw0, aer->header_log.dw1,
> +		       aer->header_log.dw2, aer->header_log.dw3);
> +	}
>  }

Regardless,
Reviewed-by; James Morse <james.morse@arm.com>


Thanks,

James

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1] efi: cper: print AER info of PCIe fatal error
  2019-07-25 12:44 ` James Morse
@ 2019-07-25 13:59   ` tanxiaofei
  0 siblings, 0 replies; 3+ messages in thread
From: tanxiaofei @ 2019-07-25 13:59 UTC (permalink / raw)
  To: James Morse
  Cc: linux-kernel, linux-acpi, linux-efi, rjw, lenb, tony.luck, bp,
	ying.huang, ross.lagerwall, ard.biesheuvel, lance.ortiz

Hi James,
Thanks for the review.

On 2019/7/25 20:44, James Morse wrote:
> Hi,
> 
> On 12/07/2019 03:20, Xiaofei Tan wrote:
>> AER info of PCIe fatal error is not printed in the current driver.
>> Because APEI driver will panic directly for fatal error, and can't
>> run to the place of printing AER info.
>>
>> An example log is as following:
>> [ 3157.655028] {763}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 11
>> [ 3157.663610] {763}[Hardware Error]: event severity: fatal
>> [ 3157.663612] {763}[Hardware Error]:  Error 0, type: fatal
>> [ 3157.663614] {763}[Hardware Error]:   section_type: PCIe error
>> [ 3157.680328] {763}[Hardware Error]:   port_type: 0, PCIe end point
>> [ 3157.680329] {763}[Hardware Error]:   version: 4.0
>> [ 3157.680332] {763}[Hardware Error]:   command: 0x0000, status: 0x0010
>> [ 3157.698757] {763}[Hardware Error]:   device_id: 0000:82:00.0
>> [ 3157.698758] {763}[Hardware Error]:   slot: 0
>> [ 3157.698759] {763}[Hardware Error]:   secondary_bus: 0x00
>> [ 3157.698760] {763}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
>> [ 3157.698761] {763}[Hardware Error]:   class_code: 000002
>> [ 3157.698825] Kernel panic - not syncing: Fatal hardware error!
>>
>> This issue was imported by the patch, '37448adfc7ce ("aerdrv: Move
>> cper_print_aer() call out of interrupt context")'. To fix this issue,
>> this patch adds print of AER info in cper_print_pcie() for fatal error.
>>
>> Here is the example log after this patch applied:
>> [ 7032.893566] {24}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
>> [ 7032.901965] {24}[Hardware Error]: event severity: fatal
>> [ 7032.907166] {24}[Hardware Error]:  Error 0, type: fatal
>> [ 7032.912366] {24}[Hardware Error]:   section_type: PCIe error
>> [ 7032.917998] {24}[Hardware Error]:   port_type: 0, PCIe end point
>> [ 7032.923974] {24}[Hardware Error]:   version: 4.0
>> [ 7032.928569] {24}[Hardware Error]:   command: 0x0546, status: 0x4010
>> [ 7032.934806] {24}[Hardware Error]:   device_id: 0000:01:00.0
>> [ 7032.940352] {24}[Hardware Error]:   slot: 0
>> [ 7032.944514] {24}[Hardware Error]:   secondary_bus: 0x00
>> [ 7032.949714] {24}[Hardware Error]:   vendor_id: 0x15b3, device_id: 0x1019
>> [ 7032.956381] {24}[Hardware Error]:   class_code: 000002
>> [ 7032.961495] {24}[Hardware Error]:   aer_uncor_status: 0x00040000, aer_uncor_mask: 0x00000000
>> [ 7032.969891] {24}[Hardware Error]:   aer_uncor_severity: 0x00062010
>> [ 7032.976042] {24}[Hardware Error]:   TLP Header: 000000c0 01010000 00000001 00000000
>> [ 7032.983663] Kernel panic - not syncing: Fatal hardware error!
> 
>> Fixes: 37448adfc7ce ("aerdrv: Move cper_print_aer() call out of
>> interrupt context")
> 
> (Please put this all on one line)
> 

OK.

>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 8fa977c..bf8600d 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -390,6 +390,19 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>>  		printk(
>>  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
> 
> It may be worth a comment explaining why we only do this for fatal errors. Something like:
> | /* Fatal errors call __ghes_panic() before the AER handler gets to print this */
> 

OK. I will add this comment.

> 
>> +	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO &&
>> +	    gdata->error_severity & CPER_SEV_FATAL) {
>> +		struct aer_capability_regs *aer;
>> +
>> +		aer = (struct aer_capability_regs *)pcie->aer_info;
>> +		printk("%saer_uncor_status: 0x%08x, aer_uncor_mask: 0x%08x\n",
> 
> The convention in the rest of the file is for the prefix format string to be separate. i.e:
> | "%s""aer_uncor_status: ..."
> 
> Could it be the same for consistency?
>

That way is not accepted by checkpatch.pl anymore, and was not used in some new commit.
Such as :
printk("%ssection_type: ARM processor error\n", newpfx);
and
printk("%ssection_type: IA32/X64 processor error\n", newpfx);

>> +		       pfx, aer->uncor_status, aer->uncor_mask);
>> +		printk("%saer_uncor_severity: 0x%08x\n",
>> +		       pfx, aer->uncor_severity);
>> +		printk("%sTLP Header: %08x %08x %08x %08x\n", pfx,
>> +		       aer->header_log.dw0, aer->header_log.dw1,
>> +		       aer->header_log.dw2, aer->header_log.dw3);
>> +	}
>>  }
> 
> Regardless,
> Reviewed-by; James Morse <james.morse@arm.com>
> 
> 
> Thanks,
> 
> James
> 
> .
> 

-- 
 thanks
tanxiaofei


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-07-25 13:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-12  2:20 [PATCH 1/1] efi: cper: print AER info of PCIe fatal error Xiaofei Tan
2019-07-25 12:44 ` James Morse
2019-07-25 13:59   ` tanxiaofei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).