linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support
@ 2023-01-13 15:40 Jonathan Cameron
  2023-01-13 15:40 ` [RFC PATCH 1/2] cxl: RAS: Multiple header recording support Jonathan Cameron
  2023-01-13 15:40 ` [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump Jonathan Cameron
  0 siblings, 2 replies; 10+ messages in thread
From: Jonathan Cameron @ 2023-01-13 15:40 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield, Dave Jiang

CXL UE RAS Error reporting allows an EP to report the capability of
recording Multiple Header Logs for uncorrectable errors.
Unlike equivalent feature in PCIe, there is no enable control
for this feature, so a supporting device may be expecting
a more complex software flow than that necessary for devices
that do not support this feature. Documentation of this feature
is sparse, with assumption it works the same as PCIe.

There are hardware implementation choices allowed in the
equivalent PCIe r6.0 base spec section (6.4.2.4) that could
be safely used with the existing code, even with Multiple
Header Recording support but there are others that cannot.

The issue is what happens when the EP is doing Multiple Header
Recording but then the software writes 1 to clear more than one
status bit at the time (PCIe spec warns against doing this
 - but it is what the current kernel code will do):
Option 1)
  It does the nice thing and clears all matching errors.
  Note this is a bit strange for the case where the device
  supports logging multiple instances of a given error - so
  the two can't be combined cleanly.  With that feature
  I can't see how anyone could implement hardware that coped
  cleanly with the wrong software flow.
Option 2)
  It clears only the first error bit leaving a bunch of error
  bits set (note that if it has recorded multiple errors of
  same type it might not even do that). These are sticky
  across resets, so you will probably end up coming back up
  and immediately seeing an error.

So whilst you can design an EP to safe against non MH recording
aware software, it isn't generally the case. As we don't have
an explicit enable on CXL we have to handle anything reporting
the capability in a MH safe fashion.

This feature was developed against emulation in QEMU.
The relevant patches have not yet been posted but can be found on
https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
along with description of how to inject errors in the patch
descriptions.  I'll post them for review for QEMU inclusion
shortly.

RFC simply because the lack of specification detail means I am
less sure on this code than I would normally be. Unfortunately it
could be argued that the first patch is a fix for the
current upstream CXL RAS support.  If we want a simpler fix
one option would be to just fail to enable RAS support if
Multiple Header recording capability bit is set.  Or we
decide that it doesn't matter for now and add support for this
feature via the normal merge cycle.

Second patch is just there to make this easier to test as
no additional software is needed to print the header log.

Base is rather messy due to a clash between multiple cxl tree
branches.
cxl/fixes with the trace move on cxl/next cherry picked on top
as it moves the code that was fixed.

Jonathan Cameron (2):
  cxl: RAS: Multiple header recording support
  cxl: Add tprintk support for header log hex dump

 drivers/cxl/core/pci.c   | 17 ++++++++++++-----
 drivers/cxl/core/trace.h |  7 +++++--
 drivers/cxl/cxl.h        |  1 +
 3 files changed, 18 insertions(+), 7 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/2] cxl: RAS: Multiple header recording support
  2023-01-13 15:40 [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support Jonathan Cameron
@ 2023-01-13 15:40 ` Jonathan Cameron
  2023-01-17 18:05   ` Dave Jiang
  2023-01-13 15:40 ` [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump Jonathan Cameron
  1 sibling, 1 reply; 10+ messages in thread
From: Jonathan Cameron @ 2023-01-13 15:40 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield, Dave Jiang

Similar to PCIe, CXL devices may support logging multiple headers
corresponding to multiple errors as reported via the CXL RAS capability.

Unlike PCIe, in CXL there is no Multiple Header Recording Enable bit
and the CXL r3.0 specification is sparse on details. As such, the
kernel should allow for any reasonable interpretation including
endpoints for which the capability bit is set that behave as per
the PCIe equivalent definitions (with assumption that the missing
'enable bit' is set). Note that behaving as if Multiple Headers
are being logged is also valid behavior when they are not so this
approach should be safe with all sensible specification interpretations.

By repeatedly attempting to clear a single bit corresponding to the reported
First Error (may need multiple goes if multiple records of same type
are tracked by the hardware) the additional header logs may be obtained.

Note that each trace record only records the FE in the status.
We could record them all as done without Multi header recording
capability but that seemed less intuitive to me.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/pci.c | 17 ++++++++++++-----
 drivers/cxl/cxl.h      |  1 +
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 184ead6a2796..6fd311e313c6 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -673,10 +673,13 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
 	void __iomem *addr;
 	u32 status;
 	u32 fe;
+	bool mh;
 
 	if (!cxlds->regs.ras)
 		return false;
 
+next_record:
+	mh = false;
 	addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
 	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
@@ -684,11 +687,13 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
 
 	/* If multiple errors, log header points to first error from ctrl reg */
 	if (hweight32(status) > 1) {
-		void __iomem *rcc_addr =
-			cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET;
-
-		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
-				   readl(rcc_addr)));
+		u32 capctrl = readl(cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET);
+		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK, capctrl));
+		if (FIELD_GET(CXL_RAS_CAP_CONTROL_MH_REC_CAP, capctrl)) {
+			mh = true;
+			/* Report and clear only first error */
+			status = fe;
+		}
 	} else {
 		fe = status;
 	}
@@ -696,6 +701,8 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
 	header_log_copy(cxlds, hl);
 	trace_cxl_aer_uncorrectable_error(dev, status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
+	if (mh)
+		goto next_record;
 
 	return true;
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index aa3af3bb73b2..ee31a99073c2 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -138,6 +138,7 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define   CXL_RAS_CORRECTABLE_MASK_MASK GENMASK(6, 0)
 #define CXL_RAS_CAP_CONTROL_OFFSET 0x14
 #define CXL_RAS_CAP_CONTROL_FE_MASK GENMASK(5, 0)
+#define CXL_RAS_CAP_CONTROL_MH_REC_CAP BIT(9)
 #define CXL_RAS_HEADER_LOG_OFFSET 0x18
 #define CXL_RAS_CAPABILITY_LENGTH 0x58
 #define CXL_HEADERLOG_SIZE SZ_512
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump
  2023-01-13 15:40 [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support Jonathan Cameron
  2023-01-13 15:40 ` [RFC PATCH 1/2] cxl: RAS: Multiple header recording support Jonathan Cameron
@ 2023-01-13 15:40 ` Jonathan Cameron
  2023-01-17 18:08   ` Dave Jiang
  1 sibling, 1 reply; 10+ messages in thread
From: Jonathan Cameron @ 2023-01-13 15:40 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield, Dave Jiang

May not make sense in general, but very helpful when writing multiple
header logging support.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/trace.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 20ca2fe2ca8e..64f6ad13529d 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -62,10 +62,13 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 		 */
 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
 	),
-	TP_printk("%s: status: '%s' first_error: '%s'",
+	TP_printk("%s: status: '%s' first_error: '%s' header_log: %s",
 		  __get_str(dev_name),
 		  show_uc_errs(__entry->status),
-		  show_uc_errs(__entry->first_error)
+		  show_uc_errs(__entry->first_error),
+		__print_hex_dump("", DUMP_PREFIX_OFFSET, 32, 4,
+				(char *)__entry->header_log,
+				CXL_HEADERLOG_SIZE_U32, false)
 	)
 );
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] cxl: RAS: Multiple header recording support
  2023-01-13 15:40 ` [RFC PATCH 1/2] cxl: RAS: Multiple header recording support Jonathan Cameron
@ 2023-01-17 18:05   ` Dave Jiang
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Jiang @ 2023-01-17 18:05 UTC (permalink / raw)
  To: Jonathan Cameron, linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield



On 1/13/23 8:40 AM, Jonathan Cameron wrote:
> Similar to PCIe, CXL devices may support logging multiple headers
> corresponding to multiple errors as reported via the CXL RAS capability.
> 
> Unlike PCIe, in CXL there is no Multiple Header Recording Enable bit
> and the CXL r3.0 specification is sparse on details. As such, the
> kernel should allow for any reasonable interpretation including
> endpoints for which the capability bit is set that behave as per
> the PCIe equivalent definitions (with assumption that the missing
> 'enable bit' is set). Note that behaving as if Multiple Headers
> are being logged is also valid behavior when they are not so this
> approach should be safe with all sensible specification interpretations.
> 
> By repeatedly attempting to clear a single bit corresponding to the reported
> First Error (may need multiple goes if multiple records of same type
> are tracked by the hardware) the additional header logs may be obtained.
> 
> Note that each trace record only records the FE in the status.
> We could record them all as done without Multi header recording
> capability but that seemed less intuitive to me.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Looks reasonable

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/cxl/core/pci.c | 17 ++++++++++++-----
>   drivers/cxl/cxl.h      |  1 +
>   2 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 184ead6a2796..6fd311e313c6 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -673,10 +673,13 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>   	void __iomem *addr;
>   	u32 status;
>   	u32 fe;
> +	bool mh;
>   
>   	if (!cxlds->regs.ras)
>   		return false;
>   
> +next_record:
> +	mh = false;
>   	addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>   	status = readl(addr);
>   	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
> @@ -684,11 +687,13 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>   
>   	/* If multiple errors, log header points to first error from ctrl reg */
>   	if (hweight32(status) > 1) {
> -		void __iomem *rcc_addr =
> -			cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET;
> -
> -		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
> -				   readl(rcc_addr)));
> +		u32 capctrl = readl(cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET);
> +		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK, capctrl));
> +		if (FIELD_GET(CXL_RAS_CAP_CONTROL_MH_REC_CAP, capctrl)) {
> +			mh = true;
> +			/* Report and clear only first error */
> +			status = fe;
> +		}
>   	} else {
>   		fe = status;
>   	}
> @@ -696,6 +701,8 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>   	header_log_copy(cxlds, hl);
>   	trace_cxl_aer_uncorrectable_error(dev, status, fe, hl);
>   	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
> +	if (mh)
> +		goto next_record;
>   
>   	return true;
>   }
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index aa3af3bb73b2..ee31a99073c2 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -138,6 +138,7 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>   #define   CXL_RAS_CORRECTABLE_MASK_MASK GENMASK(6, 0)
>   #define CXL_RAS_CAP_CONTROL_OFFSET 0x14
>   #define CXL_RAS_CAP_CONTROL_FE_MASK GENMASK(5, 0)
> +#define CXL_RAS_CAP_CONTROL_MH_REC_CAP BIT(9)
>   #define CXL_RAS_HEADER_LOG_OFFSET 0x18
>   #define CXL_RAS_CAPABILITY_LENGTH 0x58
>   #define CXL_HEADERLOG_SIZE SZ_512

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump
  2023-01-13 15:40 ` [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump Jonathan Cameron
@ 2023-01-17 18:08   ` Dave Jiang
  2023-01-18  9:43     ` Jonathan Cameron
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Jiang @ 2023-01-17 18:08 UTC (permalink / raw)
  To: Jonathan Cameron, linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield



On 1/13/23 8:40 AM, Jonathan Cameron wrote:
> May not make sense in general, but very helpful when writing multiple
> header logging support.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

The user logging software would be retrieving from the log data directly 
instead of parsing the trace output right? Will people visually inspect 
the formatted trace output? Useful for debugging?

> ---
>   drivers/cxl/core/trace.h | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 20ca2fe2ca8e..64f6ad13529d 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -62,10 +62,13 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>   		 */
>   		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>   	),
> -	TP_printk("%s: status: '%s' first_error: '%s'",
> +	TP_printk("%s: status: '%s' first_error: '%s' header_log: %s",
>   		  __get_str(dev_name),
>   		  show_uc_errs(__entry->status),
> -		  show_uc_errs(__entry->first_error)
> +		  show_uc_errs(__entry->first_error),
> +		__print_hex_dump("", DUMP_PREFIX_OFFSET, 32, 4,
> +				(char *)__entry->header_log,
> +				CXL_HEADERLOG_SIZE_U32, false)
>   	)
>   );
>   

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump
  2023-01-17 18:08   ` Dave Jiang
@ 2023-01-18  9:43     ` Jonathan Cameron
  2023-01-18 15:19       ` Dave Jiang
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Cameron @ 2023-01-18  9:43 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, dan.j.williams, linuxarm, ira.weiny, vishal.l.verma,
	alison.schofield

On Tue, 17 Jan 2023 11:08:26 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 1/13/23 8:40 AM, Jonathan Cameron wrote:
> > May not make sense in general, but very helpful when writing multiple
> > header logging support.
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>  
> 
> The user logging software would be retrieving from the log data directly 
> instead of parsing the trace output right? Will people visually inspect 
> the formatted trace output? Useful for debugging?

Rather depends on the use case and rate of expected errors + if you are
trying to debug errors before your logging infrastructure is up.

Sure in production, likely these tracepoints would just get stored to a DB for later
querying.  But in debug, I often just use the kernel parameter
to directly dump them in the main kernel log, or perf's own decoding of a trace.

The nearest equivalent to this is AER which does print the log.
https://elixir.bootlin.com/linux/v6.2-rc4/source/include/ras/ras_event.h#L337

though that uses __print_array so maybe that's a better choice.
Don't get the nice offsets though if we do that though...

I'm not fussed about this patch going upstream or not, it just makes testing
patch 1 a lot easier :)

Jonathan


> 
> > ---
> >   drivers/cxl/core/trace.h | 7 +++++--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> > index 20ca2fe2ca8e..64f6ad13529d 100644
> > --- a/drivers/cxl/core/trace.h
> > +++ b/drivers/cxl/core/trace.h
> > @@ -62,10 +62,13 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> >   		 */
> >   		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> >   	),
> > -	TP_printk("%s: status: '%s' first_error: '%s'",
> > +	TP_printk("%s: status: '%s' first_error: '%s' header_log: %s",
> >   		  __get_str(dev_name),
> >   		  show_uc_errs(__entry->status),
> > -		  show_uc_errs(__entry->first_error)
> > +		  show_uc_errs(__entry->first_error),
> > +		__print_hex_dump("", DUMP_PREFIX_OFFSET, 32, 4,
> > +				(char *)__entry->header_log,
> > +				CXL_HEADERLOG_SIZE_U32, false)
> >   	)
> >   );
> >     


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump
  2023-01-18  9:43     ` Jonathan Cameron
@ 2023-01-18 15:19       ` Dave Jiang
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Jiang @ 2023-01-18 15:19 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, dan.j.williams, linuxarm, ira.weiny, vishal.l.verma,
	alison.schofield



On 1/18/23 2:43 AM, Jonathan Cameron wrote:
> On Tue, 17 Jan 2023 11:08:26 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
>> On 1/13/23 8:40 AM, Jonathan Cameron wrote:
>>> May not make sense in general, but very helpful when writing multiple
>>> header logging support.
>>>
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> The user logging software would be retrieving from the log data directly
>> instead of parsing the trace output right? Will people visually inspect
>> the formatted trace output? Useful for debugging?
> 
> Rather depends on the use case and rate of expected errors + if you are
> trying to debug errors before your logging infrastructure is up.
> 
> Sure in production, likely these tracepoints would just get stored to a DB for later
> querying.  But in debug, I often just use the kernel parameter
> to directly dump them in the main kernel log, or perf's own decoding of a trace.
> 
> The nearest equivalent to this is AER which does print the log.
> https://elixir.bootlin.com/linux/v6.2-rc4/source/include/ras/ras_event.h#L337
> 
> though that uses __print_array so maybe that's a better choice.
> Don't get the nice offsets though if we do that though...
> 
> I'm not fussed about this patch going upstream or not, it just makes testing
> patch 1 a lot easier :)

Thanks for the explanation.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> Jonathan
> 
> 
>>
>>> ---
>>>    drivers/cxl/core/trace.h | 7 +++++--
>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
>>> index 20ca2fe2ca8e..64f6ad13529d 100644
>>> --- a/drivers/cxl/core/trace.h
>>> +++ b/drivers/cxl/core/trace.h
>>> @@ -62,10 +62,13 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>>>    		 */
>>>    		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>>>    	),
>>> -	TP_printk("%s: status: '%s' first_error: '%s'",
>>> +	TP_printk("%s: status: '%s' first_error: '%s' header_log: %s",
>>>    		  __get_str(dev_name),
>>>    		  show_uc_errs(__entry->status),
>>> -		  show_uc_errs(__entry->first_error)
>>> +		  show_uc_errs(__entry->first_error),
>>> +		__print_hex_dump("", DUMP_PREFIX_OFFSET, 32, 4,
>>> +				(char *)__entry->header_log,
>>> +				CXL_HEADERLOG_SIZE_U32, false)
>>>    	)
>>>    );
>>>      
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support
  2023-01-13 15:53 ` Jonathan Cameron
@ 2023-01-17 17:36   ` Dave Jiang
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Jiang @ 2023-01-17 17:36 UTC (permalink / raw)
  To: Jonathan Cameron, linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield



On 1/13/23 8:53 AM, Jonathan Cameron wrote:
> On Fri, 13 Jan 2023 15:40:09 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> Missed Dave Jiang off cc so resent
> (I thought I'd hit cancel fast enough but apparently not)
> Sorry for the noise!

No worries. I see them from the list either way.

> 
>> CXL UE RAS Error reporting allows an EP to report the capability of
>> recording Multiple Header Logs for uncorrectable errors.
>> Unlike equivalent feature in PCIe, there is no enable control
>> for this feature, so a supporting device may be expecting
>> a more complex software flow than that necessary for devices
>> that do not support this feature. Documentation of this feature
>> is sparse, with assumption it works the same as PCIe.
>>
>> There are hardware implementation choices allowed in the
>> equivalent PCIe r6.0 base spec section (6.4.2.4) that could
>> be safely used with the existing code, even with Multiple
>> Header Recording support but there are others that cannot.
>>
>> The issue is what happens when the EP is doing Multiple Header
>> Recording but then the software writes 1 to clear more than one
>> status bit at the time (PCIe spec warns against doing this
>>   - but it is what the current kernel code will do):
>> Option 1)
>>    It does the nice thing and clears all matching errors.
>>    Note this is a bit strange for the case where the device
>>    supports logging multiple instances of a given error - so
>>    the two can't be combined cleanly.  With that feature
>>    I can't see how anyone could implement hardware that coped
>>    cleanly with the wrong software flow.
>> Option 2)
>>    It clears only the first error bit leaving a bunch of error
>>    bits set (note that if it has recorded multiple errors of
>>    same type it might not even do that). These are sticky
>>    across resets, so you will probably end up coming back up
>>    and immediately seeing an error.
>>
>> So whilst you can design an EP to safe against non MH recording
>> aware software, it isn't generally the case. As we don't have
>> an explicit enable on CXL we have to handle anything reporting
>> the capability in a MH safe fashion.
>>
>> This feature was developed against emulation in QEMU.
>> The relevant patches have not yet been posted but can be found on
>> https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
>> along with description of how to inject errors in the patch
>> descriptions.  I'll post them for review for QEMU inclusion
>> shortly.
>>
>> RFC simply because the lack of specification detail means I am
>> less sure on this code than I would normally be. Unfortunately it
>> could be argued that the first patch is a fix for the
>> current upstream CXL RAS support.  If we want a simpler fix
>> one option would be to just fail to enable RAS support if
>> Multiple Header recording capability bit is set.  Or we
>> decide that it doesn't matter for now and add support for this
>> feature via the normal merge cycle.
>>
>> Second patch is just there to make this easier to test as
>> no additional software is needed to print the header log.
>>
>> Base is rather messy due to a clash between multiple cxl tree
>> branches.
>> cxl/fixes with the trace move on cxl/next cherry picked on top
>> as it moves the code that was fixed.
>>
>> Jonathan Cameron (2):
>>    cxl: RAS: Multiple header recording support
>>    cxl: Add tprintk support for header log hex dump
>>
>>   drivers/cxl/core/pci.c   | 17 ++++++++++++-----
>>   drivers/cxl/core/trace.h |  7 +++++--
>>   drivers/cxl/cxl.h        |  1 +
>>   3 files changed, 18 insertions(+), 7 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support
  2023-01-13 15:40 [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support Jonathan Cameron
@ 2023-01-13 15:53 ` Jonathan Cameron
  2023-01-17 17:36   ` Dave Jiang
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Cameron @ 2023-01-13 15:53 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield

On Fri, 13 Jan 2023 15:40:09 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

Missed Dave Jiang off cc so resent
(I thought I'd hit cancel fast enough but apparently not)
Sorry for the noise!

> CXL UE RAS Error reporting allows an EP to report the capability of
> recording Multiple Header Logs for uncorrectable errors.
> Unlike equivalent feature in PCIe, there is no enable control
> for this feature, so a supporting device may be expecting
> a more complex software flow than that necessary for devices
> that do not support this feature. Documentation of this feature
> is sparse, with assumption it works the same as PCIe.
> 
> There are hardware implementation choices allowed in the
> equivalent PCIe r6.0 base spec section (6.4.2.4) that could
> be safely used with the existing code, even with Multiple
> Header Recording support but there are others that cannot.
> 
> The issue is what happens when the EP is doing Multiple Header
> Recording but then the software writes 1 to clear more than one
> status bit at the time (PCIe spec warns against doing this
>  - but it is what the current kernel code will do):
> Option 1)
>   It does the nice thing and clears all matching errors.
>   Note this is a bit strange for the case where the device
>   supports logging multiple instances of a given error - so
>   the two can't be combined cleanly.  With that feature
>   I can't see how anyone could implement hardware that coped
>   cleanly with the wrong software flow.
> Option 2)
>   It clears only the first error bit leaving a bunch of error
>   bits set (note that if it has recorded multiple errors of
>   same type it might not even do that). These are sticky
>   across resets, so you will probably end up coming back up
>   and immediately seeing an error.
> 
> So whilst you can design an EP to safe against non MH recording
> aware software, it isn't generally the case. As we don't have
> an explicit enable on CXL we have to handle anything reporting
> the capability in a MH safe fashion.
> 
> This feature was developed against emulation in QEMU.
> The relevant patches have not yet been posted but can be found on
> https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
> along with description of how to inject errors in the patch
> descriptions.  I'll post them for review for QEMU inclusion
> shortly.
> 
> RFC simply because the lack of specification detail means I am
> less sure on this code than I would normally be. Unfortunately it
> could be argued that the first patch is a fix for the
> current upstream CXL RAS support.  If we want a simpler fix
> one option would be to just fail to enable RAS support if
> Multiple Header recording capability bit is set.  Or we
> decide that it doesn't matter for now and add support for this
> feature via the normal merge cycle.
> 
> Second patch is just there to make this easier to test as
> no additional software is needed to print the header log.
> 
> Base is rather messy due to a clash between multiple cxl tree
> branches.
> cxl/fixes with the trace move on cxl/next cherry picked on top
> as it moves the code that was fixed.
> 
> Jonathan Cameron (2):
>   cxl: RAS: Multiple header recording support
>   cxl: Add tprintk support for header log hex dump
> 
>  drivers/cxl/core/pci.c   | 17 ++++++++++++-----
>  drivers/cxl/core/trace.h |  7 +++++--
>  drivers/cxl/cxl.h        |  1 +
>  3 files changed, 18 insertions(+), 7 deletions(-)
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support
@ 2023-01-13 15:40 Jonathan Cameron
  2023-01-13 15:53 ` Jonathan Cameron
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Cameron @ 2023-01-13 15:40 UTC (permalink / raw)
  To: linux-cxl, dan.j.williams
  Cc: linuxarm, ira.weiny, vishal.l.verma, alison.schofield

CXL UE RAS Error reporting allows an EP to report the capability of
recording Multiple Header Logs for uncorrectable errors.
Unlike equivalent feature in PCIe, there is no enable control
for this feature, so a supporting device may be expecting
a more complex software flow than that necessary for devices
that do not support this feature. Documentation of this feature
is sparse, with assumption it works the same as PCIe.

There are hardware implementation choices allowed in the
equivalent PCIe r6.0 base spec section (6.4.2.4) that could
be safely used with the existing code, even with Multiple
Header Recording support but there are others that cannot.

The issue is what happens when the EP is doing Multiple Header
Recording but then the software writes 1 to clear more than one
status bit at the time (PCIe spec warns against doing this
 - but it is what the current kernel code will do):
Option 1)
  It does the nice thing and clears all matching errors.
  Note this is a bit strange for the case where the device
  supports logging multiple instances of a given error - so
  the two can't be combined cleanly.  With that feature
  I can't see how anyone could implement hardware that coped
  cleanly with the wrong software flow.
Option 2)
  It clears only the first error bit leaving a bunch of error
  bits set (note that if it has recorded multiple errors of
  same type it might not even do that). These are sticky
  across resets, so you will probably end up coming back up
  and immediately seeing an error.

So whilst you can design an EP to safe against non MH recording
aware software, it isn't generally the case. As we don't have
an explicit enable on CXL we have to handle anything reporting
the capability in a MH safe fashion.

This feature was developed against emulation in QEMU.
The relevant patches have not yet been posted but can be found on
https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
along with description of how to inject errors in the patch
descriptions.  I'll post them for review for QEMU inclusion
shortly.

RFC simply because the lack of specification detail means I am
less sure on this code than I would normally be. Unfortunately it
could be argued that the first patch is a fix for the
current upstream CXL RAS support.  If we want a simpler fix
one option would be to just fail to enable RAS support if
Multiple Header recording capability bit is set.  Or we
decide that it doesn't matter for now and add support for this
feature via the normal merge cycle.

Second patch is just there to make this easier to test as
no additional software is needed to print the header log.

Base is rather messy due to a clash between multiple cxl tree
branches.
cxl/fixes with the trace move on cxl/next cherry picked on top
as it moves the code that was fixed.

Jonathan Cameron (2):
  cxl: RAS: Multiple header recording support
  cxl: Add tprintk support for header log hex dump

 drivers/cxl/core/pci.c   | 17 ++++++++++++-----
 drivers/cxl/core/trace.h |  7 +++++--
 drivers/cxl/cxl.h        |  1 +
 3 files changed, 18 insertions(+), 7 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-01-18 15:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-13 15:40 [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support Jonathan Cameron
2023-01-13 15:40 ` [RFC PATCH 1/2] cxl: RAS: Multiple header recording support Jonathan Cameron
2023-01-17 18:05   ` Dave Jiang
2023-01-13 15:40 ` [RFC PATCH 2/2] cxl: Add tprintk support for header log hex dump Jonathan Cameron
2023-01-17 18:08   ` Dave Jiang
2023-01-18  9:43     ` Jonathan Cameron
2023-01-18 15:19       ` Dave Jiang
  -- strict thread matches above, loose matches on Subject: below --
2023-01-13 15:40 [RFC PATCH 0/2] CXL UE RAS Multiple Header Logging support Jonathan Cameron
2023-01-13 15:53 ` Jonathan Cameron
2023-01-17 17:36   ` Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).