All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs
@ 2019-05-23 12:28 Frederic Barrat
  2019-05-23 13:45 ` Michael Ellerman
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Frederic Barrat @ 2019-05-23 12:28 UTC (permalink / raw)
  To: linuxppc-dev, ajd, arbab, aik; +Cc: clombard, groug

If the kernel is notified of an HMI caused by the NPU2, it's currently
not being recognized and it logs the default message:

    Unknown Malfunction Alert of type 3

The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
possible causes, but we should at least log that it's an NPU problem
and report which FIR and which bit were raised if opal gave us the
information.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---

Could be merged independently from (the opal-api.h change is already
in the skiboot tree), but works better with, the matching skiboot
change:
http://patchwork.ozlabs.org/patch/1104076/


 arch/powerpc/include/asm/opal-api.h       |  1 +
 arch/powerpc/platforms/powernv/opal-hmi.c | 40 +++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index e1577cfa7186..2492fe248e1e 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
 	CHECKSTOP_TYPE_UNKNOWN	=	0,
 	CHECKSTOP_TYPE_CORE	=	1,
 	CHECKSTOP_TYPE_NX	=	2,
+	CHECKSTOP_TYPE_NPU	=	3
 };
 
 enum OpalHMI_CoreXstopReason {
diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c
index 586ec71a4e17..de12a240b477 100644
--- a/arch/powerpc/platforms/powernv/opal-hmi.c
+++ b/arch/powerpc/platforms/powernv/opal-hmi.c
@@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
 					xstop_reason[i].description);
 }
 
+static void print_npu_checkstop_reason(const char *level,
+					struct OpalHMIEvent *hmi_evt)
+{
+	uint8_t reason, reason_count, i;
+
+	/*
+	 * We may not have a checkstop reason on some combination of
+	 * hardware and/or skiboot version
+	 */
+	if (!hmi_evt->u.xstop_error.xstop_reason) {
+		printk("%s	NPU checkstop on chip %x\n", level,
+			be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
+		return;
+	}
+
+	/*
+	 * NPU2 has 3 FIRs. Reason encoded on a byte as:
+	 *   2 bits for the FIR number
+	 *   6 bits for the bit number
+	 * It may be possible to find several reasons.
+	 *
+	 * We don't display a specific message per FIR bit as there
+	 * are too many and most are meaningless without the workbook
+	 * and/or hw team help anyway.
+	 */
+	reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
+		sizeof(reason);
+	for (i = 0; i < reason_count; i++) {
+		reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 0xFF;
+		if (reason)
+			printk("%s	NPU checkstop on chip %x: FIR%d bit %d is set\n",
+				level,
+				be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
+				reason >> 6, reason & 0x3F);
+	}
+}
+
 static void print_checkstop_reason(const char *level,
 					struct OpalHMIEvent *hmi_evt)
 {
@@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
 	case CHECKSTOP_TYPE_NX:
 		print_nx_checkstop_reason(level, hmi_evt);
 		break;
+	case CHECKSTOP_TYPE_NPU:
+		print_npu_checkstop_reason(level, hmi_evt);
+		break;
 	default:
 		printk("%s	Unknown Malfunction Alert of type %d\n",
 		       level, type);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs
  2019-05-23 12:28 [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs Frederic Barrat
@ 2019-05-23 13:45 ` Michael Ellerman
  2019-05-23 14:52   ` Frederic Barrat
  2019-05-30  2:00 ` Andrew Donnellan
  2019-06-03 12:32 ` Michael Ellerman
  2 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2019-05-23 13:45 UTC (permalink / raw)
  To: Frederic Barrat, linuxppc-dev, ajd, arbab, aik; +Cc: clombard, groug

Frederic Barrat <fbarrat@linux.ibm.com> writes:

> If the kernel is notified of an HMI caused by the NPU2, it's currently
> not being recognized and it logs the default message:
>
>     Unknown Malfunction Alert of type 3
>
> The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
> possible causes, but we should at least log that it's an NPU problem
> and report which FIR and which bit were raised if opal gave us the
> information.
>
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
>
> Could be merged independently from (the opal-api.h change is already
> in the skiboot tree), but works better with, the matching skiboot
> change:
> http://patchwork.ozlabs.org/patch/1104076/

Well it *must* work with or without the skiboot change, because old/new
kernels will run on old/new skiboots.

It looks like it will work fine, we just won't get any extra information
in xstop_reason, right?

cheers

> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index e1577cfa7186..2492fe248e1e 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
>  	CHECKSTOP_TYPE_UNKNOWN	=	0,
>  	CHECKSTOP_TYPE_CORE	=	1,
>  	CHECKSTOP_TYPE_NX	=	2,
> +	CHECKSTOP_TYPE_NPU	=	3
>  };
>  
>  enum OpalHMI_CoreXstopReason {
> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c
> index 586ec71a4e17..de12a240b477 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
>  					xstop_reason[i].description);
>  }
>  
> +static void print_npu_checkstop_reason(const char *level,
> +					struct OpalHMIEvent *hmi_evt)
> +{
> +	uint8_t reason, reason_count, i;
> +
> +	/*
> +	 * We may not have a checkstop reason on some combination of
> +	 * hardware and/or skiboot version
> +	 */
> +	if (!hmi_evt->u.xstop_error.xstop_reason) {
> +		printk("%s	NPU checkstop on chip %x\n", level,
> +			be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
> +		return;
> +	}
> +
> +	/*
> +	 * NPU2 has 3 FIRs. Reason encoded on a byte as:
> +	 *   2 bits for the FIR number
> +	 *   6 bits for the bit number
> +	 * It may be possible to find several reasons.
> +	 *
> +	 * We don't display a specific message per FIR bit as there
> +	 * are too many and most are meaningless without the workbook
> +	 * and/or hw team help anyway.
> +	 */
> +	reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
> +		sizeof(reason);
> +	for (i = 0; i < reason_count; i++) {
> +		reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 0xFF;
> +		if (reason)
> +			printk("%s	NPU checkstop on chip %x: FIR%d bit %d is set\n",
> +				level,
> +				be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
> +				reason >> 6, reason & 0x3F);
> +	}
> +}
> +
>  static void print_checkstop_reason(const char *level,
>  					struct OpalHMIEvent *hmi_evt)
>  {
> @@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
>  	case CHECKSTOP_TYPE_NX:
>  		print_nx_checkstop_reason(level, hmi_evt);
>  		break;
> +	case CHECKSTOP_TYPE_NPU:
> +		print_npu_checkstop_reason(level, hmi_evt);
> +		break;
>  	default:
>  		printk("%s	Unknown Malfunction Alert of type %d\n",
>  		       level, type);
> -- 
> 2.21.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs
  2019-05-23 13:45 ` Michael Ellerman
@ 2019-05-23 14:52   ` Frederic Barrat
  0 siblings, 0 replies; 5+ messages in thread
From: Frederic Barrat @ 2019-05-23 14:52 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev, ajd, arbab, aik; +Cc: clombard, groug



Le 23/05/2019 à 15:45, Michael Ellerman a écrit :
> Frederic Barrat <fbarrat@linux.ibm.com> writes:
> 
>> If the kernel is notified of an HMI caused by the NPU2, it's currently
>> not being recognized and it logs the default message:
>>
>>      Unknown Malfunction Alert of type 3
>>
>> The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
>> possible causes, but we should at least log that it's an NPU problem
>> and report which FIR and which bit were raised if opal gave us the
>> information.
>>
>> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
>> ---
>>
>> Could be merged independently from (the opal-api.h change is already
>> in the skiboot tree), but works better with, the matching skiboot
>> change:
>> http://patchwork.ozlabs.org/patch/1104076/
> 
> Well it *must* work with or without the skiboot change, because old/new
> kernels will run on old/new skiboots.
> 
> It looks like it will work fine, we just won't get any extra information
> in xstop_reason, right?


Yes, that's understood, and it was tested. On an old skiboot, we're now 
printing that we got an NPU checkstop (instead of the "unknown 
malfunction alert"), we just won't have the extra FIR info. That's what 
I meant by "works better with the skiboot patch".

   Fred



> cheers
> 
>> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
>> index e1577cfa7186..2492fe248e1e 100644
>> --- a/arch/powerpc/include/asm/opal-api.h
>> +++ b/arch/powerpc/include/asm/opal-api.h
>> @@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
>>   	CHECKSTOP_TYPE_UNKNOWN	=	0,
>>   	CHECKSTOP_TYPE_CORE	=	1,
>>   	CHECKSTOP_TYPE_NX	=	2,
>> +	CHECKSTOP_TYPE_NPU	=	3
>>   };
>>   
>>   enum OpalHMI_CoreXstopReason {
>> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c
>> index 586ec71a4e17..de12a240b477 100644
>> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
>> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
>> @@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
>>   					xstop_reason[i].description);
>>   }
>>   
>> +static void print_npu_checkstop_reason(const char *level,
>> +					struct OpalHMIEvent *hmi_evt)
>> +{
>> +	uint8_t reason, reason_count, i;
>> +
>> +	/*
>> +	 * We may not have a checkstop reason on some combination of
>> +	 * hardware and/or skiboot version
>> +	 */
>> +	if (!hmi_evt->u.xstop_error.xstop_reason) {
>> +		printk("%s	NPU checkstop on chip %x\n", level,
>> +			be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * NPU2 has 3 FIRs. Reason encoded on a byte as:
>> +	 *   2 bits for the FIR number
>> +	 *   6 bits for the bit number
>> +	 * It may be possible to find several reasons.
>> +	 *
>> +	 * We don't display a specific message per FIR bit as there
>> +	 * are too many and most are meaningless without the workbook
>> +	 * and/or hw team help anyway.
>> +	 */
>> +	reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
>> +		sizeof(reason);
>> +	for (i = 0; i < reason_count; i++) {
>> +		reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 0xFF;
>> +		if (reason)
>> +			printk("%s	NPU checkstop on chip %x: FIR%d bit %d is set\n",
>> +				level,
>> +				be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
>> +				reason >> 6, reason & 0x3F);
>> +	}
>> +}
>> +
>>   static void print_checkstop_reason(const char *level,
>>   					struct OpalHMIEvent *hmi_evt)
>>   {
>> @@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
>>   	case CHECKSTOP_TYPE_NX:
>>   		print_nx_checkstop_reason(level, hmi_evt);
>>   		break;
>> +	case CHECKSTOP_TYPE_NPU:
>> +		print_npu_checkstop_reason(level, hmi_evt);
>> +		break;
>>   	default:
>>   		printk("%s	Unknown Malfunction Alert of type %d\n",
>>   		       level, type);
>> -- 
>> 2.21.0
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs
  2019-05-23 12:28 [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs Frederic Barrat
  2019-05-23 13:45 ` Michael Ellerman
@ 2019-05-30  2:00 ` Andrew Donnellan
  2019-06-03 12:32 ` Michael Ellerman
  2 siblings, 0 replies; 5+ messages in thread
From: Andrew Donnellan @ 2019-05-30  2:00 UTC (permalink / raw)
  To: Frederic Barrat, linuxppc-dev, arbab, aik; +Cc: clombard, groug

On 23/5/19 10:28 pm, Frederic Barrat wrote:
> If the kernel is notified of an HMI caused by the NPU2, it's currently
> not being recognized and it logs the default message:
> 
>      Unknown Malfunction Alert of type 3
> 
> The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
> possible causes, but we should at least log that it's an NPU problem
> and report which FIR and which bit were raised if opal gave us the
> information.
> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
> 
> Could be merged independently from (the opal-api.h change is already
> in the skiboot tree), but works better with, the matching skiboot
> change:
> http://patchwork.ozlabs.org/patch/1104076/

Still always safest to hold off until merged into skiboot :)

Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>

> 
> 
>   arch/powerpc/include/asm/opal-api.h       |  1 +
>   arch/powerpc/platforms/powernv/opal-hmi.c | 40 +++++++++++++++++++++++
>   2 files changed, 41 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index e1577cfa7186..2492fe248e1e 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -568,6 +568,7 @@ enum OpalHMI_XstopType {
>   	CHECKSTOP_TYPE_UNKNOWN	=	0,
>   	CHECKSTOP_TYPE_CORE	=	1,
>   	CHECKSTOP_TYPE_NX	=	2,
> +	CHECKSTOP_TYPE_NPU	=	3
>   };
>   
>   enum OpalHMI_CoreXstopReason {
> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c
> index 586ec71a4e17..de12a240b477 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -149,6 +149,43 @@ static void print_nx_checkstop_reason(const char *level,
>   					xstop_reason[i].description);
>   }
>   
> +static void print_npu_checkstop_reason(const char *level,
> +					struct OpalHMIEvent *hmi_evt)
> +{
> +	uint8_t reason, reason_count, i;
> +
> +	/*
> +	 * We may not have a checkstop reason on some combination of
> +	 * hardware and/or skiboot version
> +	 */
> +	if (!hmi_evt->u.xstop_error.xstop_reason) {
> +		printk("%s	NPU checkstop on chip %x\n", level,
> +			be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id));
> +		return;
> +	}
> +
> +	/*
> +	 * NPU2 has 3 FIRs. Reason encoded on a byte as:
> +	 *   2 bits for the FIR number
> +	 *   6 bits for the bit number
> +	 * It may be possible to find several reasons.
> +	 *
> +	 * We don't display a specific message per FIR bit as there
> +	 * are too many and most are meaningless without the workbook
> +	 * and/or hw team help anyway.
> +	 */
> +	reason_count = sizeof(hmi_evt->u.xstop_error.xstop_reason) /
> +		sizeof(reason);
> +	for (i = 0; i < reason_count; i++) {
> +		reason = (hmi_evt->u.xstop_error.xstop_reason >> (8 * i)) & 0xFF;

very nitpicky: should we call be32_to_cpu() so that the bits appear in 
order?

> +		if (reason)
> +			printk("%s	NPU checkstop on chip %x: FIR%d bit %d is set\n",
> +				level,
> +				be32_to_cpu(hmi_evt->u.xstop_error.u.chip_id),
> +				reason >> 6, reason & 0x3F);
> +	}
> +}
> +
>   static void print_checkstop_reason(const char *level,
>   					struct OpalHMIEvent *hmi_evt)
>   {
> @@ -160,6 +197,9 @@ static void print_checkstop_reason(const char *level,
>   	case CHECKSTOP_TYPE_NX:
>   		print_nx_checkstop_reason(level, hmi_evt);
>   		break;
> +	case CHECKSTOP_TYPE_NPU:
> +		print_npu_checkstop_reason(level, hmi_evt);
> +		break;
>   	default:
>   		printk("%s	Unknown Malfunction Alert of type %d\n",
>   		       level, type);
> 

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs
  2019-05-23 12:28 [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs Frederic Barrat
  2019-05-23 13:45 ` Michael Ellerman
  2019-05-30  2:00 ` Andrew Donnellan
@ 2019-06-03 12:32 ` Michael Ellerman
  2 siblings, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2019-06-03 12:32 UTC (permalink / raw)
  To: Frederic Barrat, linuxppc-dev, ajd, arbab, aik; +Cc: clombard, groug

On Thu, 2019-05-23 at 12:28:04 UTC, Frederic Barrat wrote:
> If the kernel is notified of an HMI caused by the NPU2, it's currently
> not being recognized and it logs the default message:
> 
>     Unknown Malfunction Alert of type 3
> 
> The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
> possible causes, but we should at least log that it's an NPU problem
> and report which FIR and which bit were raised if opal gave us the
> information.
> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/89d87bcba2874d824affb7842bb3960c

cheers

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-06-03 12:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23 12:28 [PATCH] powerpc/powernv: Show checkstop reason for NPU2 HMIs Frederic Barrat
2019-05-23 13:45 ` Michael Ellerman
2019-05-23 14:52   ` Frederic Barrat
2019-05-30  2:00 ` Andrew Donnellan
2019-06-03 12:32 ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.