All of lore.kernel.org
 help / color / mirror / Atom feed
* EDAC driver for ARMv8 L1/L2 cache
       [not found] <VI1PR04MB2078DBC9381EA574CE5DA4B09A100@VI1PR04MB2078.eurprd04.prod.outlook.com>
@ 2018-01-09 21:43 ` Borislav Petkov
  2018-01-09 21:51   ` York Sun
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2018-01-09 21:43 UTC (permalink / raw)
  To: linux-arm-kernel

Adding some more people to CC.

On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote:
> Borislav,
> 
> Are you aware of any existing (or in development) EDAC driver for ARMv8
> L1/L2 cache? I am thinking to write one if not available yet.

no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene.

Please synchronize with their authors and ARM people what would be the
best thing to do and try extracting shared functionality from them
into a common compilation unit instead of duplicating it. I don't want
separate drivers per functional unit.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-09 21:43 ` EDAC driver for ARMv8 L1/L2 cache Borislav Petkov
@ 2018-01-09 21:51   ` York Sun
  2018-01-09 22:00     ` Borislav Petkov
  2018-01-12 16:38     ` Thor Thayer
  0 siblings, 2 replies; 25+ messages in thread
From: York Sun @ 2018-01-09 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/09/2018 01:43 PM, Borislav Petkov wrote:
> Adding some more people to CC.
> 
> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote:
>> Borislav,
>>
>> Are you aware of any existing (or in development) EDAC driver for ARMv8
>> L1/L2 cache? I am thinking to write one if not available yet.
> 
> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene.
> 
> Please synchronize with their authors and ARM people what would be the
> best thing to do and try extracting shared functionality from them
> into a common compilation unit instead of duplicating it. I don't want
> separate drivers per functional unit.
> 

Thanks for the pointer. Thunderx and xgene's drivers have different
implementation on the hardware. I found one patch closer to what I
expect, https://patchwork.kernel.org/patch/7513231/. I don't see
activities after 2015. I will reach out to the author.

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-09 21:51   ` York Sun
@ 2018-01-09 22:00     ` Borislav Petkov
  2018-01-12 16:38     ` Thor Thayer
  1 sibling, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2018-01-09 22:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 09, 2018 at 09:51:32PM +0000, York Sun wrote:
> On 01/09/2018 01:43 PM, Borislav Petkov wrote:
> > Adding some more people to CC.
> > 
> > On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote:
> >> Borislav,
> >>
> >> Are you aware of any existing (or in development) EDAC driver for ARMv8
> >> L1/L2 cache? I am thinking to write one if not available yet.
> > 
> > no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene.
> > 
> > Please synchronize with their authors and ARM people what would be the
> > best thing to do and try extracting shared functionality from them
> > into a common compilation unit instead of duplicating it. I don't want
> > separate drivers per functional unit.
> > 
> 
> Thanks for the pointer. Thunderx and xgene's drivers have different
> implementation on the hardware. I found one patch closer to what I
> expect, https://patchwork.kernel.org/patch/7513231/. I don't see
> activities after 2015. I will reach out to the author.

He's probably busy with SME/SEV right now.

+ Marc.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-09 21:51   ` York Sun
  2018-01-09 22:00     ` Borislav Petkov
@ 2018-01-12 16:38     ` Thor Thayer
  2018-01-12 16:48       ` York Sun
  1 sibling, 1 reply; 25+ messages in thread
From: Thor Thayer @ 2018-01-12 16:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/09/2018 03:51 PM, York Sun wrote:
> On 01/09/2018 01:43 PM, Borislav Petkov wrote:
>> Adding some more people to CC.
>>
>> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote:
>>> Borislav,
>>>
>>> Are you aware of any existing (or in development) EDAC driver for ARMv8
>>> L1/L2 cache? I am thinking to write one if not available yet.
>>
>> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene.
>>
>> Please synchronize with their authors and ARM people what would be the
>> best thing to do and try extracting shared functionality from them
>> into a common compilation unit instead of duplicating it. I don't want
>> separate drivers per functional unit.
>>
> 
> Thanks for the pointer. Thunderx and xgene's drivers have different
> implementation on the hardware. I found one patch closer to what I
> expect, https://patchwork.kernel.org/patch/7513231/. I don't see
> activities after 2015. I will reach out to the author.
> 
> York
> 

Hi York,

I'll be adding EDAC support for our Stratix10 (Quad ARM Cortex A-53) at 
some point but I'm not sure when it is scheduled - probably in the next 
few months. I'll be going through the same process of comparing our 
architecture to others. Below are links to the Stratix10.

https://www.altera.com/products/soc/portfolio/stratix-10-soc/overview.html
https://www.altera.com/products/soc/portfolio/stratix-10-soc/support.html


> --
> To unsubscribe from this list: send the line "unsubscribe linux-edac" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 16:38     ` Thor Thayer
@ 2018-01-12 16:48       ` York Sun
  2018-01-12 17:12         ` Borislav Petkov
  2018-01-12 17:39         ` Thor Thayer
  0 siblings, 2 replies; 25+ messages in thread
From: York Sun @ 2018-01-12 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/12/2018 08:35 AM, Thor Thayer wrote:
> On 01/09/2018 03:51 PM, York Sun wrote:
>> On 01/09/2018 01:43 PM, Borislav Petkov wrote:
>>> Adding some more people to CC.
>>>
>>> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote:
>>>> Borislav,
>>>>
>>>> Are you aware of any existing (or in development) EDAC driver for ARMv8
>>>> L1/L2 cache? I am thinking to write one if not available yet.
>>>
>>> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene.
>>>
>>> Please synchronize with their authors and ARM people what would be the
>>> best thing to do and try extracting shared functionality from them
>>> into a common compilation unit instead of duplicating it. I don't want
>>> separate drivers per functional unit.
>>>
>>
>> Thanks for the pointer. Thunderx and xgene's drivers have different
>> implementation on the hardware. I found one patch closer to what I
>> expect, https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F7513231%2F&data=02%7C01%7Cyork.sun%40nxp.com%7C29cd81c44431418e2a6308d559da7d3b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636513717317947616&sdata=flIZL8mxGc4VWbTpl5nPjkRXaB8uNjcc382HcXO08qE%3D&reserved=0. I don't see
>> activities after 2015. I will reach out to the author.
>>
>> York
>>
> 
> Hi York,
> 
> I'll be adding EDAC support for our Stratix10 (Quad ARM Cortex A-53) at 
> some point but I'm not sure when it is scheduled - probably in the next 
> few months. I'll be going through the same process of comparing our 
> architecture to others. Below are links to the Stratix10.
>

I see Stratix10 has A53 core. I am concerned on reading the
CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
registers. They may not be available on all SoCs, or all time. The
CPUMERRSR_EL1 needs to be read on each core, the L2MERRSR_EL1 needs to
be read on each cluster. I am not sure how to handle them in EDAC driver
yet. I am hoping Mark Rutland can shed some light as he commented on
similar patches [1][2] before.

York

[1] https://patchwork.kernel.org/patch/7460391/
[2] https://patchwork.kernel.org/patch/7513231/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 16:48       ` York Sun
@ 2018-01-12 17:12         ` Borislav Petkov
  2018-01-12 17:17           ` York Sun
  2018-01-12 17:23           ` Mark Rutland
  2018-01-12 17:39         ` Thor Thayer
  1 sibling, 2 replies; 25+ messages in thread
From: Borislav Petkov @ 2018-01-12 17:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
> I see Stratix10 has A53 core. I am concerned on reading the
> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
> registers. They may not be available on all SoCs, or all time.

Is there something like CPUID on x86, on ARM64 which denotes presence of
a certain feature?

Or is that thing devicetree?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 17:12         ` Borislav Petkov
@ 2018-01-12 17:17           ` York Sun
  2018-01-12 17:38             ` Mark Rutland
  2018-01-12 17:23           ` Mark Rutland
  1 sibling, 1 reply; 25+ messages in thread
From: York Sun @ 2018-01-12 17:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/12/2018 09:13 AM, Borislav Petkov wrote:
> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
>> I see Stratix10 has A53 core. I am concerned on reading the
>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
>> registers. They may not be available on all SoCs, or all time.
> 
> Is there something like CPUID on x86, on ARM64 which denotes presence of
> a certain feature?
> 
> Or is that thing devicetree?
> 

This feature is available on the SoC I am working on (NXP LS1046A). It
seems always there. I don't know if there is any register denoting the
existence of such feature. I guess we can use device tree if this
feature exists. Not sure if big.LITTLE is a concern here.

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 17:12         ` Borislav Petkov
  2018-01-12 17:17           ` York Sun
@ 2018-01-12 17:23           ` Mark Rutland
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2018-01-12 17:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 12, 2018 at 06:12:51PM +0100, Borislav Petkov wrote:
> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
> > I see Stratix10 has A53 core. I am concerned on reading the
> > CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
> > registers. They may not be available on all SoCs, or all time.
> 
> Is there something like CPUID on x86, on ARM64 which denotes presence of
> a certain feature?
> 
> Or is that thing devicetree?

We have ID registers, like CPUID, but for various reasons those aren't
sufficient to guarantee that IMPLEMENTATION DEFINED features can be
used.

If we want to support this, we'd certainly need something in DT.

For ACPI systems I'd expect this to be hidden behind APEI or similar,
with the kernel staying well clear of these registers.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 17:17           ` York Sun
@ 2018-01-12 17:38             ` Mark Rutland
  2018-01-12 17:44               ` York Sun
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2018-01-12 17:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote:
> On 01/12/2018 09:13 AM, Borislav Petkov wrote:
> > On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
> >> I see Stratix10 has A53 core. I am concerned on reading the
> >> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
> >> registers. They may not be available on all SoCs, or all time.
> > 
> > Is there something like CPUID on x86, on ARM64 which denotes presence of
> > a certain feature?
> > 
> > Or is that thing devicetree?
> 
> This feature is available on the SoC I am working on (NXP LS1046A). It
> seems always there. I don't know if there is any register denoting the
> existence of such feature.

There is no architectural register describing this.

Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register
describing this.

Regardless, a DT binding is necessary due to potential interactions with
FW, hypervisors, etc.

> I guess we can use device tree if this feature exists. Not sure if
> big.LITTLE is a concern here.

There are big.LITTLE systems with Cortex-A53, so we definitely care
about big.LITTLE here.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 16:48       ` York Sun
  2018-01-12 17:12         ` Borislav Petkov
@ 2018-01-12 17:39         ` Thor Thayer
  1 sibling, 0 replies; 25+ messages in thread
From: Thor Thayer @ 2018-01-12 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/12/2018 10:48 AM, York Sun wrote:
> On 01/12/2018 08:35 AM, Thor Thayer wrote:
>> On 01/09/2018 03:51 PM, York Sun wrote:
>>> On 01/09/2018 01:43 PM, Borislav Petkov wrote:
>>>> Adding some more people to CC.
>>>>
>>>> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote:
>>>>> Borislav,
>>>>>
>>>>> Are you aware of any existing (or in development) EDAC driver for ARMv8
>>>>> L1/L2 cache? I am thinking to write one if not available yet.
>>>>
>>>> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene.
>>>>
>>>> Please synchronize with their authors and ARM people what would be the
>>>> best thing to do and try extracting shared functionality from them
>>>> into a common compilation unit instead of duplicating it. I don't want
>>>> separate drivers per functional unit.
>>>>
>>>
>>> Thanks for the pointer. Thunderx and xgene's drivers have different
>>> implementation on the hardware. I found one patch closer to what I
>>> expect, https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F7513231%2F&data=02%7C01%7Cyork.sun%40nxp.com%7C29cd81c44431418e2a6308d559da7d3b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636513717317947616&sdata=flIZL8mxGc4VWbTpl5nPjkRXaB8uNjcc382HcXO08qE%3D&reserved=0. I don't see
>>> activities after 2015. I will reach out to the author.
>>>
>>> York
>>>
>>
>> Hi York,
>>
>> I'll be adding EDAC support for our Stratix10 (Quad ARM Cortex A-53) at
>> some point but I'm not sure when it is scheduled - probably in the next
>> few months. I'll be going through the same process of comparing our
>> architecture to others. Below are links to the Stratix10.
>>
> 
> I see Stratix10 has A53 core. I am concerned on reading the
> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
> registers. They may not be available on all SoCs, or all time. The
> CPUMERRSR_EL1 needs to be read on each core, the L2MERRSR_EL1 needs to
> be read on each cluster. I am not sure how to handle them in EDAC driver
> yet. I am hoping Mark Rutland can shed some light as he commented on
> similar patches [1][2] before.
> 
> York
> 
> [1] https://patchwork.kernel.org/patch/7460391/
> [2] https://patchwork.kernel.org/patch/7513231/

I haven't started digging into EDAC for the Stratix10 yet so I don't 
have a comment but I'll check out your links. Thanks!

> --
> To unsubscribe from this list: send the line "unsubscribe linux-edac" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 17:38             ` Mark Rutland
@ 2018-01-12 17:44               ` York Sun
  2018-01-12 18:00                 ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: York Sun @ 2018-01-12 17:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/12/2018 09:38 AM, Mark Rutland wrote:
> On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote:
>> On 01/12/2018 09:13 AM, Borislav Petkov wrote:
>>> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
>>>> I see Stratix10 has A53 core. I am concerned on reading the
>>>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
>>>> registers. They may not be available on all SoCs, or all time.
>>>
>>> Is there something like CPUID on x86, on ARM64 which denotes presence of
>>> a certain feature?
>>>
>>> Or is that thing devicetree?
>>
>> This feature is available on the SoC I am working on (NXP LS1046A). It
>> seems always there. I don't know if there is any register denoting the
>> existence of such feature.
> 
> There is no architectural register describing this.
> 
> Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register
> describing this.
> 
> Regardless, a DT binding is necessary due to potential interactions with
> FW, hypervisors, etc.
> 
>> I guess we can use device tree if this feature exists. Not sure if
>> big.LITTLE is a concern here.
> 
> There are big.LITTLE systems with Cortex-A53, so we definitely care
> about big.LITTLE here.
> 

For a given system, for example A72-A53 big.LITTLE configuration, the
feature is known and can be described in DT. We have to detect which
core is running to determine if this feature is available. Does this
sound right?

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 17:44               ` York Sun
@ 2018-01-12 18:00                 ` Mark Rutland
  2018-01-12 18:16                   ` York Sun
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2018-01-12 18:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 12, 2018 at 05:44:56PM +0000, York Sun wrote:
> On 01/12/2018 09:38 AM, Mark Rutland wrote:
> > On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote:
> >> On 01/12/2018 09:13 AM, Borislav Petkov wrote:
> >>> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
> >>>> I see Stratix10 has A53 core. I am concerned on reading the
> >>>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
> >>>> registers. They may not be available on all SoCs, or all time.
> >>>
> >>> Is there something like CPUID on x86, on ARM64 which denotes presence of
> >>> a certain feature?
> >>>
> >>> Or is that thing devicetree?
> >>
> >> This feature is available on the SoC I am working on (NXP LS1046A). It
> >> seems always there. I don't know if there is any register denoting the
> >> existence of such feature.
> > 
> > There is no architectural register describing this.
> > 
> > Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register
> > describing this.
> > 
> > Regardless, a DT binding is necessary due to potential interactions with
> > FW, hypervisors, etc.
> > 
> >> I guess we can use device tree if this feature exists. Not sure if
> >> big.LITTLE is a concern here.
> > 
> > There are big.LITTLE systems with Cortex-A53, so we definitely care
> > about big.LITTLE here.
> > 
> 
> For a given system, for example A72-A53 big.LITTLE configuration, the
> feature is known and can be described in DT. We have to detect which
> core is running to determine if this feature is available. Does this
> sound right?

The binding will need to explicitly describe the set of CPUs the feature
is usable on.

On a big.LITTLE system, I'd expect multiple nodes in the DT, similar to
what we have for PMUs. Even if all CPUs have some EDAC functionality, it
will differ across microarchitectures.

e.g. we'd have something like:

edac-a72 {
	compatible = "arm,cortex-a72-edac";
	cpus = <&cpu0>, <&cpu 1>;
	...
};

edac-a53 {
	compatible = "arm,cortex-a53-edac";
	cpus = <&cpu2>, <&cpu 3>;
	...
};

... and it may get more complicated from there on. There may be other
users like FW or Secure OSs that we have to interact with, this might
get arbitrarily reset across idle, etc.

The Cortex-A53 TRM doesn't use the term EDAC at all, and refers
separately to "CPU Memory Error" functionality an "L2 Memory Error"
functionality. I don't know if those are expected to be used separately.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 18:00                 ` Mark Rutland
@ 2018-01-12 18:16                   ` York Sun
  2018-01-13 11:31                     ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: York Sun @ 2018-01-12 18:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/12/2018 10:01 AM, Mark Rutland wrote:
> On Fri, Jan 12, 2018 at 05:44:56PM +0000, York Sun wrote:
>> On 01/12/2018 09:38 AM, Mark Rutland wrote:
>>> On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote:
>>>> On 01/12/2018 09:13 AM, Borislav Petkov wrote:
>>>>> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote:
>>>>>> I see Stratix10 has A53 core. I am concerned on reading the
>>>>>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED
>>>>>> registers. They may not be available on all SoCs, or all time.
>>>>>
>>>>> Is there something like CPUID on x86, on ARM64 which denotes presence of
>>>>> a certain feature?
>>>>>
>>>>> Or is that thing devicetree?
>>>>
>>>> This feature is available on the SoC I am working on (NXP LS1046A). It
>>>> seems always there. I don't know if there is any register denoting the
>>>> existence of such feature.
>>>
>>> There is no architectural register describing this.
>>>
>>> Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register
>>> describing this.
>>>
>>> Regardless, a DT binding is necessary due to potential interactions with
>>> FW, hypervisors, etc.
>>>
>>>> I guess we can use device tree if this feature exists. Not sure if
>>>> big.LITTLE is a concern here.
>>>
>>> There are big.LITTLE systems with Cortex-A53, so we definitely care
>>> about big.LITTLE here.
>>>
>>
>> For a given system, for example A72-A53 big.LITTLE configuration, the
>> feature is known and can be described in DT. We have to detect which
>> core is running to determine if this feature is available. Does this
>> sound right?
> 
> The binding will need to explicitly describe the set of CPUs the feature
> is usable on.
> 
> On a big.LITTLE system, I'd expect multiple nodes in the DT, similar to
> what we have for PMUs. Even if all CPUs have some EDAC functionality, it
> will differ across microarchitectures.
> 
> e.g. we'd have something like:
> 
> edac-a72 {
> 	compatible = "arm,cortex-a72-edac";
> 	cpus = <&cpu0>, <&cpu 1>;
> 	...
> };
> 
> edac-a53 {
> 	compatible = "arm,cortex-a53-edac";
> 	cpus = <&cpu2>, <&cpu 3>;
> 	...
> };
> 
> ... and it may get more complicated from there on. There may be other
> users like FW or Secure OSs that we have to interact with, this might
> get arbitrarily reset across idle, etc.
> 
> The Cortex-A53 TRM doesn't use the term EDAC at all, and refers
> separately to "CPU Memory Error" functionality an "L2 Memory Error"
> functionality. I don't know if those are expected to be used separately.
> 

Mark,

Thanks a lot for the guidance. I now have a better idea to deal with
big.LITTLE systems.

Can you explain more about the IMPLEMENTATION DEFINED registers? I
consulted our design team and was told there isn't a build parameter to
opt in or out these registers. So they are always available in our SoCs.
Under what condition these IMPLEMENTATION DEFINED registers would not be
available?

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-12 18:16                   ` York Sun
@ 2018-01-13 11:31                     ` Borislav Petkov
  2018-01-15 14:21                       ` Mark Rutland
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2018-01-13 11:31 UTC (permalink / raw)
  To: linux-arm-kernel

Btw,

I jus received a driver for this:

https://lkml.kernel.org/r/1515804626-21254-1-git-send-email-kyan at codeaurora.org

and I think you guys should first talk and do one version which works
for all.

Adding Kyle to CC.

@Kyle: there's also a driver here: https://patchwork.kernel.org/patch/7513231

Also, looking at it, you probably should call it arm64_edac or so, to
be clear what it is. armv8 - although known to ARM people - makes me go
look at wikipedia to refresh what v8 was again.

Also, I'd like you to figure out who's going to be maintaining it so
that get_maintainer.pl can find you on bug reports.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-13 11:31                     ` Borislav Petkov
@ 2018-01-15 14:21                       ` Mark Rutland
  2018-01-15 14:32                         ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Rutland @ 2018-01-15 14:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jan 13, 2018 at 12:31:56PM +0100, Borislav Petkov wrote:
> Btw,
> 
> I jus received a driver for this:
> 
> https://lkml.kernel.org/r/1515804626-21254-1-git-send-email-kyan at codeaurora.org
> 
> and I think you guys should first talk and do one version which works
> for all.

Unfortunately, in the asbence of the RAS extensions, any EDAC
functionality is IMPLEMENTATION DEFINED, and each CPU does it
differently, with different registers and no common programming model.

The driver linked above is *not* a generic ARMv8 EDAC driver. AFAICT< it
is specific to some Qualcomm Kryo CPUs. I'll reply there.

> 
> Adding Kyle to CC.
> 
> @Kyle: there's also a driver here: https://patchwork.kernel.org/patch/7513231
> 
> Also, looking at it, you probably should call it arm64_edac or so, to
> be clear what it is. armv8 - although known to ARM people - makes me go
> look at wikipedia to refresh what v8 was again.

I'm not sure it's possible to cover all potential EDAC implementations
behind the same driver.

If we need these drivers, they should be <cpuname>_edac or <soc>_edac.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 14:21                       ` Mark Rutland
@ 2018-01-15 14:32                         ` Borislav Petkov
  2018-01-15 14:49                           ` Mark Rutland
  2018-01-15 16:19                           ` York Sun
  0 siblings, 2 replies; 25+ messages in thread
From: Borislav Petkov @ 2018-01-15 14:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 15, 2018 at 02:21:54PM +0000, Mark Rutland wrote:
> I'm not sure it's possible to cover all potential EDAC implementations
> behind the same driver.
> 
> If we need these drivers, they should be <cpuname>_edac or <soc>_edac.

Yuck, I wanted to avoid that...

Oh well, can we at least share the barebones design and exchange
registers/DT only or is it more complicated than that?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 14:32                         ` Borislav Petkov
@ 2018-01-15 14:49                           ` Mark Rutland
  2018-01-15 16:19                           ` York Sun
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Rutland @ 2018-01-15 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 15, 2018 at 03:32:49PM +0100, Borislav Petkov wrote:
> On Mon, Jan 15, 2018 at 02:21:54PM +0000, Mark Rutland wrote:
> > I'm not sure it's possible to cover all potential EDAC implementations
> > behind the same driver.
> > 
> > If we need these drivers, they should be <cpuname>_edac or <soc>_edac.
> 
> Yuck, I wanted to avoid that...
> 
> Oh well, can we at least share the barebones design and exchange
> registers/DT only or is it more complicated than that?

Unfortunately, it's likely to be more complicated than that. The HW may
be wildly different.

The ARMv8 RAS extensions provide some level of architected EDAC support,
which should be more uniform across CPUs, FW, hypervisors, etc. That we
could support as armv8_ras_edac, and going forward I'd hope that
implementations align with that.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 14:32                         ` Borislav Petkov
  2018-01-15 14:49                           ` Mark Rutland
@ 2018-01-15 16:19                           ` York Sun
  2018-01-15 23:23                             ` Borislav Petkov
  1 sibling, 1 reply; 25+ messages in thread
From: York Sun @ 2018-01-15 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/15/2018 06:32 AM, Borislav Petkov wrote:
> On Mon, Jan 15, 2018 at 02:21:54PM +0000, Mark Rutland wrote:
>> I'm not sure it's possible to cover all potential EDAC implementations
>> behind the same driver.
>>
>> If we need these drivers, they should be <cpuname>_edac or <soc>_edac.
> 
> Yuck, I wanted to avoid that...
> 
> Oh well, can we at least share the barebones design and exchange
> registers/DT only or is it more complicated than that?
> 

I have different plan on the driver. Since I don't get interrupt on
correctable errors, my thinking is to use dynamic polling interval. With
more correctable errors, the polling interval is decreased to a
threshold, then further action needs to be taken (at least I would raise
an error message). The idea is uncorrectable error proceeds from
increasing correctable errors. I would use per CPU data structure so we
know which core has increasing errors. For embedded system, we may
shutdown the core(s) with error to protect the system from critical
failure. Similarly but differently, L2 cache is shared on the same
cluster. We may have to shutdown the whole cluster if we have excessive
correctable errors. For server, it may simply shutdown. Of course the
decision is not made by the driver, but by RAS or other monitoring policy.

Any comment is welcomed.

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 16:19                           ` York Sun
@ 2018-01-15 23:23                             ` Borislav Petkov
  2018-01-15 23:28                               ` York Sun
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2018-01-15 23:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 15, 2018 at 04:19:09PM +0000, York Sun wrote:
> I have different plan on the driver. Since I don't get interrupt on
> correctable errors,

Is that something nxp-specific or generic ARM64 thing?

If it is the former, then you'd need a special driver for your hardware,
as much as I don't like the idea of having per-vendor driver...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 23:23                             ` Borislav Petkov
@ 2018-01-15 23:28                               ` York Sun
  2018-01-15 23:52                                 ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: York Sun @ 2018-01-15 23:28 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/15/2018 03:23 PM, Borislav Petkov wrote:
> On Mon, Jan 15, 2018 at 04:19:09PM +0000, York Sun wrote:
>> I have different plan on the driver. Since I don't get interrupt on
>> correctable errors,
> 
> Is that something nxp-specific or generic ARM64 thing?
> 
> If it is the former, then you'd need a special driver for your hardware,
> as much as I don't like the idea of having per-vendor driver...
> 
It is generic ARM64 thing. I believe only SError interrupt is available.
But SError can be caused by many reasons. For L1/L2 only uncorrectable
errors trigger SError. My first step is to deal with correctable errors
and to raise a flag somehow (Haven't think that through yet) when
excessive errors are reported. Next step I will deal with the
uncorrectable errors with interrupt.

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 23:28                               ` York Sun
@ 2018-01-15 23:52                                 ` Borislav Petkov
  2018-01-16  0:16                                   ` York Sun
  2018-02-01 20:56                                   ` York Sun
  0 siblings, 2 replies; 25+ messages in thread
From: Borislav Petkov @ 2018-01-15 23:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote:
> It is generic ARM64 thing. I believe only SError interrupt is available.

So if it is, then I'd suggest you hammer out a proper design with the
ARM folks.

> But SError can be caused by many reasons. For L1/L2 only uncorrectable
> errors trigger SError. My first step is to deal with correctable errors
> and to raise a flag somehow (Haven't think that through yet) when
> excessive errors are reported. Next step I will deal with the
> uncorrectable errors with interrupt.

You need to think of the use cases first and where it makes *sense* to
do recovery actions and what those actions will be. In general make the
system more resilient. Simply reporting some error numbers just for the
sake of it, doesn't mean a whole lot.

And once you have those, figure out what kind of error info you need
from the hardware in order to do those recovery actions.

>From your other mail:
> I have different plan on the driver. Since I don't get interrupt on
> correctable errors, my thinking is to use dynamic polling interval. With
> more correctable errors, the polling interval is decreased to a
> threshold, then further action needs to be taken (at least I would raise
> an error message). The idea is uncorrectable error proceeds from
> increasing correctable errors.

Is that always the case? I wouldn't be so sure.

> I would use per CPU data structure so we know which core has
> increasing errors. For embedded system, we may shutdown the core(s)
> with error to protect the system from critical failure. Similarly but
> differently, L2 cache is shared on the same cluster.

All this implies is that the cores are the most prone to generate
errors. I wouldn't be so sure. This is almost never the case on x86, for
example. The main "fun" there is DRAM. The cores almost never.

So think about first which hw component really needs RAS protection and
why exactly.

> We may have to shutdown the whole cluster if we have excessive
> correctable errors.

Why?

Correctable errors are normally corrected by hardware and they don't
have any effect on system state. Why shut down?

You probably want to slow down operation or do some other actions to
improve the situation first. Immediately shutting down is not really
RAS.

> For server, it may simply shutdown.

This is not really gracefully handling of errors.

You need to gradually diminish system performance to alleviate error
levels and if that doesn't work, then warn and shutdown.

> Of course the decision is not made by the driver, but by RAS or other
> monitoring policy.

Just read up on RAS strategies and policies to get a better idea what
others are doing before implementing something which might not bring a
whole lot.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 23:52                                 ` Borislav Petkov
@ 2018-01-16  0:16                                   ` York Sun
  2018-02-01 20:56                                   ` York Sun
  1 sibling, 0 replies; 25+ messages in thread
From: York Sun @ 2018-01-16  0:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/15/2018 03:52 PM, Borislav Petkov wrote:
> On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote:
>> It is generic ARM64 thing. I believe only SError interrupt is available.
> 
> So if it is, then I'd suggest you hammer out a proper design with the
> ARM folks.
> 
>> But SError can be caused by many reasons. For L1/L2 only uncorrectable
>> errors trigger SError. My first step is to deal with correctable errors
>> and to raise a flag somehow (Haven't think that through yet) when
>> excessive errors are reported. Next step I will deal with the
>> uncorrectable errors with interrupt.
> 
> You need to think of the use cases first and where it makes *sense* to
> do recovery actions and what those actions will be. In general make the
> system more resilient. Simply reporting some error numbers just for the
> sake of it, doesn't mean a whole lot.

The correctable errors are corrected, so no further action is needed.
The uncorrectable errors cannot be corrected. It is too bad we have
them. But that doesn't mean we should sit there and wait for the system
to fail. Further comments below.

> 
> And once you have those, figure out what kind of error info you need
> from the hardware in order to do those recovery actions.
> 
> From your other mail:
>> I have different plan on the driver. Since I don't get interrupt on
>> correctable errors, my thinking is to use dynamic polling interval. With
>> more correctable errors, the polling interval is decreased to a
>> threshold, then further action needs to be taken (at least I would raise
>> an error message). The idea is uncorrectable error proceeds from
>> increasing correctable errors.
> > Is that always the case? I wouldn't be so sure.

I don't have first hand experience. I was told some research showed the
uncorrectable errors happen after excessive correctable errors, while
the hardware degrades. If we are interested in preventing the
uncorrectable errors from happening, the only indicator we may have is
the trend of errors.

> 
>> I would use per CPU data structure so we know which core has
>> increasing errors. For embedded system, we may shutdown the core(s)
>> with error to protect the system from critical failure. Similarly but
>> differently, L2 cache is shared on the same cluster.
> 
> All this implies is that the cores are the most prone to generate
> errors. I wouldn't be so sure. This is almost never the case on x86, for
> example. The main "fun" there is DRAM. The cores almost never.

We already have DRAM EDAC driver. I believe CPU cache is unlikely to
have errors. But when the error happens, it's better to know. I am doing
this L1/L2 cache EDAC driver because we have customer showing interest.
Honestly neither we nor our customer know what can be done at this moment.

> 
> So think about first which hw component really needs RAS protection and
> why exactly.
> 
>> We may have to shutdown the whole cluster if we have excessive
>> correctable errors.
> 
> Why?

L2 cache is shared by all cores on the same cluster. I don't think it is
a good idea to turn off cache on the affected cores. Powering off those
cores (consequently powering off the cluster) may run the system longer
until a replacement can be deployed.

> 
> Correctable errors are normally corrected by hardware and they don't
> have any effect on system state. Why shut down?

If we see increasing correctable errors, it may be a sign that
uncorrectable errors is just around the corner. Again, it is based on
the same theory above.

> 
> You probably want to slow down operation or do some other actions to
> improve the situation first. Immediately shutting down is not really
> RAS.
> 
>> For server, it may simply shutdown.
> 
> This is not really gracefully handling of errors.
> 
> You need to gradually diminish system performance to alleviate error
> levels and if that doesn't work, then warn and shutdown.

Like I said, the decision shouldn't be made by the EDAC driver. It only
reports the errors. Further action can be taken by monitoring software.

> 
>> Of course the decision is not made by the driver, but by RAS or other
>> monitoring policy.
> 
> Just read up on RAS strategies and policies to get a better idea what
> others are doing before implementing something which might not bring a
> whole lot.
> 

Agree. I am new on this topic even I am not new to Linux driver.

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-01-15 23:52                                 ` Borislav Petkov
  2018-01-16  0:16                                   ` York Sun
@ 2018-02-01 20:56                                   ` York Sun
  2018-02-08 15:31                                     ` James Morse
  1 sibling, 1 reply; 25+ messages in thread
From: York Sun @ 2018-02-01 20:56 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/15/2018 03:52 PM, Borislav Petkov wrote:
> On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote:
>> It is generic ARM64 thing. I believe only SError interrupt is available.
> 
> So if it is, then I'd suggest you hammer out a proper design with the
> ARM folks.
> 

I made some progress and need some help on coding. On the platform I am
working on, it has A53 cores. Each A53 core has a signal nINTERRIRQ.
They are connected to one GIC interrupt. I managed to inject errors to
some safe address without triggering system error and I got the
interrupt. I will need to find out which core has errors by reading
register on each core (and clear the interrupt). How can I do this
within interrupt service routine? I tried to use
smp_call_function_single() but it doesn't like the IRQ being disabled.

Any suggestion is appreciated!

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-02-01 20:56                                   ` York Sun
@ 2018-02-08 15:31                                     ` James Morse
  2018-02-08 15:53                                       ` York Sun
  0 siblings, 1 reply; 25+ messages in thread
From: James Morse @ 2018-02-08 15:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi York,

On 01/02/18 20:56, York Sun wrote:
> On 01/15/2018 03:52 PM, Borislav Petkov wrote:
>> On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote:
>>> It is generic ARM64 thing. I believe only SError interrupt is available.
>>
>> So if it is, then I'd suggest you hammer out a proper design with the
>> ARM folks.

> I made some progress and need some help on coding. On the platform I am
> working on, it has A53 cores. Each A53 core has a signal nINTERRIRQ.
> They are connected to one GIC interrupt.

Is this a fatal signal for the CPU that should have received it? The signals
start out as being per-cpu, but configured like this you can only take the
interrupt one one CPU...

(is this thing edge or level triggered?)


> I managed to inject errors to some safe address without triggering system
> error and I got the interrupt.

(okay, sounds like its a corrected error)


> I will need to find out which core has errors by reading
> register on each core (and clear the interrupt). How can I do this
> within interrupt service routine? I tried to use
> smp_call_function_single() but it doesn't like the IRQ being disabled.

mm/memory-failure.c:memory_failure_queue() has an example of how you could do
this. It uses 'schedule_work_on()' to re-run after the IRQ, from there you
should be able to call something like on_each_cpu() or smp_call_function_many().

If this thing is level triggered you can't escape the irq-handler until its
cleared. If it needs clearing by a remote CPU this is going to be a problem.


Thanks,

James

^ permalink raw reply	[flat|nested] 25+ messages in thread

* EDAC driver for ARMv8 L1/L2 cache
  2018-02-08 15:31                                     ` James Morse
@ 2018-02-08 15:53                                       ` York Sun
  0 siblings, 0 replies; 25+ messages in thread
From: York Sun @ 2018-02-08 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/08/2018 07:33 AM, James Morse wrote:
> Hi York,
> 
> On 01/02/18 20:56, York Sun wrote:
>> On 01/15/2018 03:52 PM, Borislav Petkov wrote:
>>> On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote:
>>>> It is generic ARM64 thing. I believe only SError interrupt is available.
>>>
>>> So if it is, then I'd suggest you hammer out a proper design with the
>>> ARM folks.
> 
>> I made some progress and need some help on coding. On the platform I am
>> working on, it has A53 cores. Each A53 core has a signal nINTERRIRQ.
>> They are connected to one GIC interrupt.
> 
> Is this a fatal signal for the CPU that should have received it? The signals
> start out as being per-cpu, but configured like this you can only take the
> interrupt one one CPU...
> 
> (is this thing edge or level triggered?)
> 

Level

> 
>> I managed to inject errors to some safe address without triggering system
>> error and I got the interrupt.
> 
> (okay, sounds like its a corrected error)


Double-bit error. I got the interrupt nINTERRIRQ.

> 
> 
>> I will need to find out which core has errors by reading
>> register on each core (and clear the interrupt). How can I do this
>> within interrupt service routine? I tried to use
>> smp_call_function_single() but it doesn't like the IRQ being disabled.
> 
> mm/memory-failure.c:memory_failure_queue() has an example of how you could do
> this. It uses 'schedule_work_on()' to re-run after the IRQ, from there you
> should be able to call something like on_each_cpu() or smp_call_function_many().
> 
> If this thing is level triggered you can't escape the irq-handler until its
> cleared. If it needs clearing by a remote CPU this is going to be a problem.
> 

Yeah. For now I use smp_call_function_single_async() and it looks OK. I
will send out patch for review after clearing some other hardware
questions with ARM support.

York

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-02-08 15:53 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <VI1PR04MB2078DBC9381EA574CE5DA4B09A100@VI1PR04MB2078.eurprd04.prod.outlook.com>
2018-01-09 21:43 ` EDAC driver for ARMv8 L1/L2 cache Borislav Petkov
2018-01-09 21:51   ` York Sun
2018-01-09 22:00     ` Borislav Petkov
2018-01-12 16:38     ` Thor Thayer
2018-01-12 16:48       ` York Sun
2018-01-12 17:12         ` Borislav Petkov
2018-01-12 17:17           ` York Sun
2018-01-12 17:38             ` Mark Rutland
2018-01-12 17:44               ` York Sun
2018-01-12 18:00                 ` Mark Rutland
2018-01-12 18:16                   ` York Sun
2018-01-13 11:31                     ` Borislav Petkov
2018-01-15 14:21                       ` Mark Rutland
2018-01-15 14:32                         ` Borislav Petkov
2018-01-15 14:49                           ` Mark Rutland
2018-01-15 16:19                           ` York Sun
2018-01-15 23:23                             ` Borislav Petkov
2018-01-15 23:28                               ` York Sun
2018-01-15 23:52                                 ` Borislav Petkov
2018-01-16  0:16                                   ` York Sun
2018-02-01 20:56                                   ` York Sun
2018-02-08 15:31                                     ` James Morse
2018-02-08 15:53                                       ` York Sun
2018-01-12 17:23           ` Mark Rutland
2018-01-12 17:39         ` Thor Thayer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.