* EDAC driver for ARMv8 L1/L2 cache [not found] <VI1PR04MB2078DBC9381EA574CE5DA4B09A100@VI1PR04MB2078.eurprd04.prod.outlook.com> @ 2018-01-09 21:43 ` Borislav Petkov 2018-01-09 21:51 ` York Sun 0 siblings, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2018-01-09 21:43 UTC (permalink / raw) To: linux-arm-kernel Adding some more people to CC. On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote: > Borislav, > > Are you aware of any existing (or in development) EDAC driver for ARMv8 > L1/L2 cache? I am thinking to write one if not available yet. no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene. Please synchronize with their authors and ARM people what would be the best thing to do and try extracting shared functionality from them into a common compilation unit instead of duplicating it. I don't want separate drivers per functional unit. Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-09 21:43 ` EDAC driver for ARMv8 L1/L2 cache Borislav Petkov @ 2018-01-09 21:51 ` York Sun 2018-01-09 22:00 ` Borislav Petkov 2018-01-12 16:38 ` Thor Thayer 0 siblings, 2 replies; 25+ messages in thread From: York Sun @ 2018-01-09 21:51 UTC (permalink / raw) To: linux-arm-kernel On 01/09/2018 01:43 PM, Borislav Petkov wrote: > Adding some more people to CC. > > On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote: >> Borislav, >> >> Are you aware of any existing (or in development) EDAC driver for ARMv8 >> L1/L2 cache? I am thinking to write one if not available yet. > > no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene. > > Please synchronize with their authors and ARM people what would be the > best thing to do and try extracting shared functionality from them > into a common compilation unit instead of duplicating it. I don't want > separate drivers per functional unit. > Thanks for the pointer. Thunderx and xgene's drivers have different implementation on the hardware. I found one patch closer to what I expect, https://patchwork.kernel.org/patch/7513231/. I don't see activities after 2015. I will reach out to the author. York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-09 21:51 ` York Sun @ 2018-01-09 22:00 ` Borislav Petkov 2018-01-12 16:38 ` Thor Thayer 1 sibling, 0 replies; 25+ messages in thread From: Borislav Petkov @ 2018-01-09 22:00 UTC (permalink / raw) To: linux-arm-kernel On Tue, Jan 09, 2018 at 09:51:32PM +0000, York Sun wrote: > On 01/09/2018 01:43 PM, Borislav Petkov wrote: > > Adding some more people to CC. > > > > On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote: > >> Borislav, > >> > >> Are you aware of any existing (or in development) EDAC driver for ARMv8 > >> L1/L2 cache? I am thinking to write one if not available yet. > > > > no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene. > > > > Please synchronize with their authors and ARM people what would be the > > best thing to do and try extracting shared functionality from them > > into a common compilation unit instead of duplicating it. I don't want > > separate drivers per functional unit. > > > > Thanks for the pointer. Thunderx and xgene's drivers have different > implementation on the hardware. I found one patch closer to what I > expect, https://patchwork.kernel.org/patch/7513231/. I don't see > activities after 2015. I will reach out to the author. He's probably busy with SME/SEV right now. + Marc. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-09 21:51 ` York Sun 2018-01-09 22:00 ` Borislav Petkov @ 2018-01-12 16:38 ` Thor Thayer 2018-01-12 16:48 ` York Sun 1 sibling, 1 reply; 25+ messages in thread From: Thor Thayer @ 2018-01-12 16:38 UTC (permalink / raw) To: linux-arm-kernel On 01/09/2018 03:51 PM, York Sun wrote: > On 01/09/2018 01:43 PM, Borislav Petkov wrote: >> Adding some more people to CC. >> >> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote: >>> Borislav, >>> >>> Are you aware of any existing (or in development) EDAC driver for ARMv8 >>> L1/L2 cache? I am thinking to write one if not available yet. >> >> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene. >> >> Please synchronize with their authors and ARM people what would be the >> best thing to do and try extracting shared functionality from them >> into a common compilation unit instead of duplicating it. I don't want >> separate drivers per functional unit. >> > > Thanks for the pointer. Thunderx and xgene's drivers have different > implementation on the hardware. I found one patch closer to what I > expect, https://patchwork.kernel.org/patch/7513231/. I don't see > activities after 2015. I will reach out to the author. > > York > Hi York, I'll be adding EDAC support for our Stratix10 (Quad ARM Cortex A-53) at some point but I'm not sure when it is scheduled - probably in the next few months. I'll be going through the same process of comparing our architecture to others. Below are links to the Stratix10. https://www.altera.com/products/soc/portfolio/stratix-10-soc/overview.html https://www.altera.com/products/soc/portfolio/stratix-10-soc/support.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-edac" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 16:38 ` Thor Thayer @ 2018-01-12 16:48 ` York Sun 2018-01-12 17:12 ` Borislav Petkov 2018-01-12 17:39 ` Thor Thayer 0 siblings, 2 replies; 25+ messages in thread From: York Sun @ 2018-01-12 16:48 UTC (permalink / raw) To: linux-arm-kernel On 01/12/2018 08:35 AM, Thor Thayer wrote: > On 01/09/2018 03:51 PM, York Sun wrote: >> On 01/09/2018 01:43 PM, Borislav Petkov wrote: >>> Adding some more people to CC. >>> >>> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote: >>>> Borislav, >>>> >>>> Are you aware of any existing (or in development) EDAC driver for ARMv8 >>>> L1/L2 cache? I am thinking to write one if not available yet. >>> >>> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene. >>> >>> Please synchronize with their authors and ARM people what would be the >>> best thing to do and try extracting shared functionality from them >>> into a common compilation unit instead of duplicating it. I don't want >>> separate drivers per functional unit. >>> >> >> Thanks for the pointer. Thunderx and xgene's drivers have different >> implementation on the hardware. I found one patch closer to what I >> expect, https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F7513231%2F&data=02%7C01%7Cyork.sun%40nxp.com%7C29cd81c44431418e2a6308d559da7d3b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636513717317947616&sdata=flIZL8mxGc4VWbTpl5nPjkRXaB8uNjcc382HcXO08qE%3D&reserved=0. I don't see >> activities after 2015. I will reach out to the author. >> >> York >> > > Hi York, > > I'll be adding EDAC support for our Stratix10 (Quad ARM Cortex A-53) at > some point but I'm not sure when it is scheduled - probably in the next > few months. I'll be going through the same process of comparing our > architecture to others. Below are links to the Stratix10. > I see Stratix10 has A53 core. I am concerned on reading the CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED registers. They may not be available on all SoCs, or all time. The CPUMERRSR_EL1 needs to be read on each core, the L2MERRSR_EL1 needs to be read on each cluster. I am not sure how to handle them in EDAC driver yet. I am hoping Mark Rutland can shed some light as he commented on similar patches [1][2] before. York [1] https://patchwork.kernel.org/patch/7460391/ [2] https://patchwork.kernel.org/patch/7513231/ ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 16:48 ` York Sun @ 2018-01-12 17:12 ` Borislav Petkov 2018-01-12 17:17 ` York Sun 2018-01-12 17:23 ` Mark Rutland 2018-01-12 17:39 ` Thor Thayer 1 sibling, 2 replies; 25+ messages in thread From: Borislav Petkov @ 2018-01-12 17:12 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: > I see Stratix10 has A53 core. I am concerned on reading the > CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED > registers. They may not be available on all SoCs, or all time. Is there something like CPUID on x86, on ARM64 which denotes presence of a certain feature? Or is that thing devicetree? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 17:12 ` Borislav Petkov @ 2018-01-12 17:17 ` York Sun 2018-01-12 17:38 ` Mark Rutland 2018-01-12 17:23 ` Mark Rutland 1 sibling, 1 reply; 25+ messages in thread From: York Sun @ 2018-01-12 17:17 UTC (permalink / raw) To: linux-arm-kernel On 01/12/2018 09:13 AM, Borislav Petkov wrote: > On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: >> I see Stratix10 has A53 core. I am concerned on reading the >> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED >> registers. They may not be available on all SoCs, or all time. > > Is there something like CPUID on x86, on ARM64 which denotes presence of > a certain feature? > > Or is that thing devicetree? > This feature is available on the SoC I am working on (NXP LS1046A). It seems always there. I don't know if there is any register denoting the existence of such feature. I guess we can use device tree if this feature exists. Not sure if big.LITTLE is a concern here. York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 17:17 ` York Sun @ 2018-01-12 17:38 ` Mark Rutland 2018-01-12 17:44 ` York Sun 0 siblings, 1 reply; 25+ messages in thread From: Mark Rutland @ 2018-01-12 17:38 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote: > On 01/12/2018 09:13 AM, Borislav Petkov wrote: > > On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: > >> I see Stratix10 has A53 core. I am concerned on reading the > >> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED > >> registers. They may not be available on all SoCs, or all time. > > > > Is there something like CPUID on x86, on ARM64 which denotes presence of > > a certain feature? > > > > Or is that thing devicetree? > > This feature is available on the SoC I am working on (NXP LS1046A). It > seems always there. I don't know if there is any register denoting the > existence of such feature. There is no architectural register describing this. Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register describing this. Regardless, a DT binding is necessary due to potential interactions with FW, hypervisors, etc. > I guess we can use device tree if this feature exists. Not sure if > big.LITTLE is a concern here. There are big.LITTLE systems with Cortex-A53, so we definitely care about big.LITTLE here. Thanks, Mark. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 17:38 ` Mark Rutland @ 2018-01-12 17:44 ` York Sun 2018-01-12 18:00 ` Mark Rutland 0 siblings, 1 reply; 25+ messages in thread From: York Sun @ 2018-01-12 17:44 UTC (permalink / raw) To: linux-arm-kernel On 01/12/2018 09:38 AM, Mark Rutland wrote: > On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote: >> On 01/12/2018 09:13 AM, Borislav Petkov wrote: >>> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: >>>> I see Stratix10 has A53 core. I am concerned on reading the >>>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED >>>> registers. They may not be available on all SoCs, or all time. >>> >>> Is there something like CPUID on x86, on ARM64 which denotes presence of >>> a certain feature? >>> >>> Or is that thing devicetree? >> >> This feature is available on the SoC I am working on (NXP LS1046A). It >> seems always there. I don't know if there is any register denoting the >> existence of such feature. > > There is no architectural register describing this. > > Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register > describing this. > > Regardless, a DT binding is necessary due to potential interactions with > FW, hypervisors, etc. > >> I guess we can use device tree if this feature exists. Not sure if >> big.LITTLE is a concern here. > > There are big.LITTLE systems with Cortex-A53, so we definitely care > about big.LITTLE here. > For a given system, for example A72-A53 big.LITTLE configuration, the feature is known and can be described in DT. We have to detect which core is running to determine if this feature is available. Does this sound right? York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 17:44 ` York Sun @ 2018-01-12 18:00 ` Mark Rutland 2018-01-12 18:16 ` York Sun 0 siblings, 1 reply; 25+ messages in thread From: Mark Rutland @ 2018-01-12 18:00 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jan 12, 2018 at 05:44:56PM +0000, York Sun wrote: > On 01/12/2018 09:38 AM, Mark Rutland wrote: > > On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote: > >> On 01/12/2018 09:13 AM, Borislav Petkov wrote: > >>> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: > >>>> I see Stratix10 has A53 core. I am concerned on reading the > >>>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED > >>>> registers. They may not be available on all SoCs, or all time. > >>> > >>> Is there something like CPUID on x86, on ARM64 which denotes presence of > >>> a certain feature? > >>> > >>> Or is that thing devicetree? > >> > >> This feature is available on the SoC I am working on (NXP LS1046A). It > >> seems always there. I don't know if there is any register denoting the > >> existence of such feature. > > > > There is no architectural register describing this. > > > > Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register > > describing this. > > > > Regardless, a DT binding is necessary due to potential interactions with > > FW, hypervisors, etc. > > > >> I guess we can use device tree if this feature exists. Not sure if > >> big.LITTLE is a concern here. > > > > There are big.LITTLE systems with Cortex-A53, so we definitely care > > about big.LITTLE here. > > > > For a given system, for example A72-A53 big.LITTLE configuration, the > feature is known and can be described in DT. We have to detect which > core is running to determine if this feature is available. Does this > sound right? The binding will need to explicitly describe the set of CPUs the feature is usable on. On a big.LITTLE system, I'd expect multiple nodes in the DT, similar to what we have for PMUs. Even if all CPUs have some EDAC functionality, it will differ across microarchitectures. e.g. we'd have something like: edac-a72 { compatible = "arm,cortex-a72-edac"; cpus = <&cpu0>, <&cpu 1>; ... }; edac-a53 { compatible = "arm,cortex-a53-edac"; cpus = <&cpu2>, <&cpu 3>; ... }; ... and it may get more complicated from there on. There may be other users like FW or Secure OSs that we have to interact with, this might get arbitrarily reset across idle, etc. The Cortex-A53 TRM doesn't use the term EDAC at all, and refers separately to "CPU Memory Error" functionality an "L2 Memory Error" functionality. I don't know if those are expected to be used separately. Thanks, Mark. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 18:00 ` Mark Rutland @ 2018-01-12 18:16 ` York Sun 2018-01-13 11:31 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: York Sun @ 2018-01-12 18:16 UTC (permalink / raw) To: linux-arm-kernel On 01/12/2018 10:01 AM, Mark Rutland wrote: > On Fri, Jan 12, 2018 at 05:44:56PM +0000, York Sun wrote: >> On 01/12/2018 09:38 AM, Mark Rutland wrote: >>> On Fri, Jan 12, 2018 at 05:17:54PM +0000, York Sun wrote: >>>> On 01/12/2018 09:13 AM, Borislav Petkov wrote: >>>>> On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: >>>>>> I see Stratix10 has A53 core. I am concerned on reading the >>>>>> CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED >>>>>> registers. They may not be available on all SoCs, or all time. >>>>> >>>>> Is there something like CPUID on x86, on ARM64 which denotes presence of >>>>> a certain feature? >>>>> >>>>> Or is that thing devicetree? >>>> >>>> This feature is available on the SoC I am working on (NXP LS1046A). It >>>> seems always there. I don't know if there is any register denoting the >>>> existence of such feature. >>> >>> There is no architectural register describing this. >>> >>> Judging by the Cortex-A53 TRM, there is no IMP DEF / auxilliary register >>> describing this. >>> >>> Regardless, a DT binding is necessary due to potential interactions with >>> FW, hypervisors, etc. >>> >>>> I guess we can use device tree if this feature exists. Not sure if >>>> big.LITTLE is a concern here. >>> >>> There are big.LITTLE systems with Cortex-A53, so we definitely care >>> about big.LITTLE here. >>> >> >> For a given system, for example A72-A53 big.LITTLE configuration, the >> feature is known and can be described in DT. We have to detect which >> core is running to determine if this feature is available. Does this >> sound right? > > The binding will need to explicitly describe the set of CPUs the feature > is usable on. > > On a big.LITTLE system, I'd expect multiple nodes in the DT, similar to > what we have for PMUs. Even if all CPUs have some EDAC functionality, it > will differ across microarchitectures. > > e.g. we'd have something like: > > edac-a72 { > compatible = "arm,cortex-a72-edac"; > cpus = <&cpu0>, <&cpu 1>; > ... > }; > > edac-a53 { > compatible = "arm,cortex-a53-edac"; > cpus = <&cpu2>, <&cpu 3>; > ... > }; > > ... and it may get more complicated from there on. There may be other > users like FW or Secure OSs that we have to interact with, this might > get arbitrarily reset across idle, etc. > > The Cortex-A53 TRM doesn't use the term EDAC at all, and refers > separately to "CPU Memory Error" functionality an "L2 Memory Error" > functionality. I don't know if those are expected to be used separately. > Mark, Thanks a lot for the guidance. I now have a better idea to deal with big.LITTLE systems. Can you explain more about the IMPLEMENTATION DEFINED registers? I consulted our design team and was told there isn't a build parameter to opt in or out these registers. So they are always available in our SoCs. Under what condition these IMPLEMENTATION DEFINED registers would not be available? York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 18:16 ` York Sun @ 2018-01-13 11:31 ` Borislav Petkov 2018-01-15 14:21 ` Mark Rutland 0 siblings, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2018-01-13 11:31 UTC (permalink / raw) To: linux-arm-kernel Btw, I jus received a driver for this: https://lkml.kernel.org/r/1515804626-21254-1-git-send-email-kyan at codeaurora.org and I think you guys should first talk and do one version which works for all. Adding Kyle to CC. @Kyle: there's also a driver here: https://patchwork.kernel.org/patch/7513231 Also, looking at it, you probably should call it arm64_edac or so, to be clear what it is. armv8 - although known to ARM people - makes me go look at wikipedia to refresh what v8 was again. Also, I'd like you to figure out who's going to be maintaining it so that get_maintainer.pl can find you on bug reports. Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-13 11:31 ` Borislav Petkov @ 2018-01-15 14:21 ` Mark Rutland 2018-01-15 14:32 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: Mark Rutland @ 2018-01-15 14:21 UTC (permalink / raw) To: linux-arm-kernel On Sat, Jan 13, 2018 at 12:31:56PM +0100, Borislav Petkov wrote: > Btw, > > I jus received a driver for this: > > https://lkml.kernel.org/r/1515804626-21254-1-git-send-email-kyan at codeaurora.org > > and I think you guys should first talk and do one version which works > for all. Unfortunately, in the asbence of the RAS extensions, any EDAC functionality is IMPLEMENTATION DEFINED, and each CPU does it differently, with different registers and no common programming model. The driver linked above is *not* a generic ARMv8 EDAC driver. AFAICT< it is specific to some Qualcomm Kryo CPUs. I'll reply there. > > Adding Kyle to CC. > > @Kyle: there's also a driver here: https://patchwork.kernel.org/patch/7513231 > > Also, looking at it, you probably should call it arm64_edac or so, to > be clear what it is. armv8 - although known to ARM people - makes me go > look at wikipedia to refresh what v8 was again. I'm not sure it's possible to cover all potential EDAC implementations behind the same driver. If we need these drivers, they should be <cpuname>_edac or <soc>_edac. Thanks, Mark. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 14:21 ` Mark Rutland @ 2018-01-15 14:32 ` Borislav Petkov 2018-01-15 14:49 ` Mark Rutland 2018-01-15 16:19 ` York Sun 0 siblings, 2 replies; 25+ messages in thread From: Borislav Petkov @ 2018-01-15 14:32 UTC (permalink / raw) To: linux-arm-kernel On Mon, Jan 15, 2018 at 02:21:54PM +0000, Mark Rutland wrote: > I'm not sure it's possible to cover all potential EDAC implementations > behind the same driver. > > If we need these drivers, they should be <cpuname>_edac or <soc>_edac. Yuck, I wanted to avoid that... Oh well, can we at least share the barebones design and exchange registers/DT only or is it more complicated than that? Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 14:32 ` Borislav Petkov @ 2018-01-15 14:49 ` Mark Rutland 2018-01-15 16:19 ` York Sun 1 sibling, 0 replies; 25+ messages in thread From: Mark Rutland @ 2018-01-15 14:49 UTC (permalink / raw) To: linux-arm-kernel On Mon, Jan 15, 2018 at 03:32:49PM +0100, Borislav Petkov wrote: > On Mon, Jan 15, 2018 at 02:21:54PM +0000, Mark Rutland wrote: > > I'm not sure it's possible to cover all potential EDAC implementations > > behind the same driver. > > > > If we need these drivers, they should be <cpuname>_edac or <soc>_edac. > > Yuck, I wanted to avoid that... > > Oh well, can we at least share the barebones design and exchange > registers/DT only or is it more complicated than that? Unfortunately, it's likely to be more complicated than that. The HW may be wildly different. The ARMv8 RAS extensions provide some level of architected EDAC support, which should be more uniform across CPUs, FW, hypervisors, etc. That we could support as armv8_ras_edac, and going forward I'd hope that implementations align with that. Thanks, Mark. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 14:32 ` Borislav Petkov 2018-01-15 14:49 ` Mark Rutland @ 2018-01-15 16:19 ` York Sun 2018-01-15 23:23 ` Borislav Petkov 1 sibling, 1 reply; 25+ messages in thread From: York Sun @ 2018-01-15 16:19 UTC (permalink / raw) To: linux-arm-kernel On 01/15/2018 06:32 AM, Borislav Petkov wrote: > On Mon, Jan 15, 2018 at 02:21:54PM +0000, Mark Rutland wrote: >> I'm not sure it's possible to cover all potential EDAC implementations >> behind the same driver. >> >> If we need these drivers, they should be <cpuname>_edac or <soc>_edac. > > Yuck, I wanted to avoid that... > > Oh well, can we at least share the barebones design and exchange > registers/DT only or is it more complicated than that? > I have different plan on the driver. Since I don't get interrupt on correctable errors, my thinking is to use dynamic polling interval. With more correctable errors, the polling interval is decreased to a threshold, then further action needs to be taken (at least I would raise an error message). The idea is uncorrectable error proceeds from increasing correctable errors. I would use per CPU data structure so we know which core has increasing errors. For embedded system, we may shutdown the core(s) with error to protect the system from critical failure. Similarly but differently, L2 cache is shared on the same cluster. We may have to shutdown the whole cluster if we have excessive correctable errors. For server, it may simply shutdown. Of course the decision is not made by the driver, but by RAS or other monitoring policy. Any comment is welcomed. York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 16:19 ` York Sun @ 2018-01-15 23:23 ` Borislav Petkov 2018-01-15 23:28 ` York Sun 0 siblings, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2018-01-15 23:23 UTC (permalink / raw) To: linux-arm-kernel On Mon, Jan 15, 2018 at 04:19:09PM +0000, York Sun wrote: > I have different plan on the driver. Since I don't get interrupt on > correctable errors, Is that something nxp-specific or generic ARM64 thing? If it is the former, then you'd need a special driver for your hardware, as much as I don't like the idea of having per-vendor driver... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 23:23 ` Borislav Petkov @ 2018-01-15 23:28 ` York Sun 2018-01-15 23:52 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: York Sun @ 2018-01-15 23:28 UTC (permalink / raw) To: linux-arm-kernel On 01/15/2018 03:23 PM, Borislav Petkov wrote: > On Mon, Jan 15, 2018 at 04:19:09PM +0000, York Sun wrote: >> I have different plan on the driver. Since I don't get interrupt on >> correctable errors, > > Is that something nxp-specific or generic ARM64 thing? > > If it is the former, then you'd need a special driver for your hardware, > as much as I don't like the idea of having per-vendor driver... > It is generic ARM64 thing. I believe only SError interrupt is available. But SError can be caused by many reasons. For L1/L2 only uncorrectable errors trigger SError. My first step is to deal with correctable errors and to raise a flag somehow (Haven't think that through yet) when excessive errors are reported. Next step I will deal with the uncorrectable errors with interrupt. York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 23:28 ` York Sun @ 2018-01-15 23:52 ` Borislav Petkov 2018-01-16 0:16 ` York Sun 2018-02-01 20:56 ` York Sun 0 siblings, 2 replies; 25+ messages in thread From: Borislav Petkov @ 2018-01-15 23:52 UTC (permalink / raw) To: linux-arm-kernel On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote: > It is generic ARM64 thing. I believe only SError interrupt is available. So if it is, then I'd suggest you hammer out a proper design with the ARM folks. > But SError can be caused by many reasons. For L1/L2 only uncorrectable > errors trigger SError. My first step is to deal with correctable errors > and to raise a flag somehow (Haven't think that through yet) when > excessive errors are reported. Next step I will deal with the > uncorrectable errors with interrupt. You need to think of the use cases first and where it makes *sense* to do recovery actions and what those actions will be. In general make the system more resilient. Simply reporting some error numbers just for the sake of it, doesn't mean a whole lot. And once you have those, figure out what kind of error info you need from the hardware in order to do those recovery actions. >From your other mail: > I have different plan on the driver. Since I don't get interrupt on > correctable errors, my thinking is to use dynamic polling interval. With > more correctable errors, the polling interval is decreased to a > threshold, then further action needs to be taken (at least I would raise > an error message). The idea is uncorrectable error proceeds from > increasing correctable errors. Is that always the case? I wouldn't be so sure. > I would use per CPU data structure so we know which core has > increasing errors. For embedded system, we may shutdown the core(s) > with error to protect the system from critical failure. Similarly but > differently, L2 cache is shared on the same cluster. All this implies is that the cores are the most prone to generate errors. I wouldn't be so sure. This is almost never the case on x86, for example. The main "fun" there is DRAM. The cores almost never. So think about first which hw component really needs RAS protection and why exactly. > We may have to shutdown the whole cluster if we have excessive > correctable errors. Why? Correctable errors are normally corrected by hardware and they don't have any effect on system state. Why shut down? You probably want to slow down operation or do some other actions to improve the situation first. Immediately shutting down is not really RAS. > For server, it may simply shutdown. This is not really gracefully handling of errors. You need to gradually diminish system performance to alleviate error levels and if that doesn't work, then warn and shutdown. > Of course the decision is not made by the driver, but by RAS or other > monitoring policy. Just read up on RAS strategies and policies to get a better idea what others are doing before implementing something which might not bring a whole lot. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 23:52 ` Borislav Petkov @ 2018-01-16 0:16 ` York Sun 2018-02-01 20:56 ` York Sun 1 sibling, 0 replies; 25+ messages in thread From: York Sun @ 2018-01-16 0:16 UTC (permalink / raw) To: linux-arm-kernel On 01/15/2018 03:52 PM, Borislav Petkov wrote: > On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote: >> It is generic ARM64 thing. I believe only SError interrupt is available. > > So if it is, then I'd suggest you hammer out a proper design with the > ARM folks. > >> But SError can be caused by many reasons. For L1/L2 only uncorrectable >> errors trigger SError. My first step is to deal with correctable errors >> and to raise a flag somehow (Haven't think that through yet) when >> excessive errors are reported. Next step I will deal with the >> uncorrectable errors with interrupt. > > You need to think of the use cases first and where it makes *sense* to > do recovery actions and what those actions will be. In general make the > system more resilient. Simply reporting some error numbers just for the > sake of it, doesn't mean a whole lot. The correctable errors are corrected, so no further action is needed. The uncorrectable errors cannot be corrected. It is too bad we have them. But that doesn't mean we should sit there and wait for the system to fail. Further comments below. > > And once you have those, figure out what kind of error info you need > from the hardware in order to do those recovery actions. > > From your other mail: >> I have different plan on the driver. Since I don't get interrupt on >> correctable errors, my thinking is to use dynamic polling interval. With >> more correctable errors, the polling interval is decreased to a >> threshold, then further action needs to be taken (at least I would raise >> an error message). The idea is uncorrectable error proceeds from >> increasing correctable errors. > > Is that always the case? I wouldn't be so sure. I don't have first hand experience. I was told some research showed the uncorrectable errors happen after excessive correctable errors, while the hardware degrades. If we are interested in preventing the uncorrectable errors from happening, the only indicator we may have is the trend of errors. > >> I would use per CPU data structure so we know which core has >> increasing errors. For embedded system, we may shutdown the core(s) >> with error to protect the system from critical failure. Similarly but >> differently, L2 cache is shared on the same cluster. > > All this implies is that the cores are the most prone to generate > errors. I wouldn't be so sure. This is almost never the case on x86, for > example. The main "fun" there is DRAM. The cores almost never. We already have DRAM EDAC driver. I believe CPU cache is unlikely to have errors. But when the error happens, it's better to know. I am doing this L1/L2 cache EDAC driver because we have customer showing interest. Honestly neither we nor our customer know what can be done at this moment. > > So think about first which hw component really needs RAS protection and > why exactly. > >> We may have to shutdown the whole cluster if we have excessive >> correctable errors. > > Why? L2 cache is shared by all cores on the same cluster. I don't think it is a good idea to turn off cache on the affected cores. Powering off those cores (consequently powering off the cluster) may run the system longer until a replacement can be deployed. > > Correctable errors are normally corrected by hardware and they don't > have any effect on system state. Why shut down? If we see increasing correctable errors, it may be a sign that uncorrectable errors is just around the corner. Again, it is based on the same theory above. > > You probably want to slow down operation or do some other actions to > improve the situation first. Immediately shutting down is not really > RAS. > >> For server, it may simply shutdown. > > This is not really gracefully handling of errors. > > You need to gradually diminish system performance to alleviate error > levels and if that doesn't work, then warn and shutdown. Like I said, the decision shouldn't be made by the EDAC driver. It only reports the errors. Further action can be taken by monitoring software. > >> Of course the decision is not made by the driver, but by RAS or other >> monitoring policy. > > Just read up on RAS strategies and policies to get a better idea what > others are doing before implementing something which might not bring a > whole lot. > Agree. I am new on this topic even I am not new to Linux driver. York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-15 23:52 ` Borislav Petkov 2018-01-16 0:16 ` York Sun @ 2018-02-01 20:56 ` York Sun 2018-02-08 15:31 ` James Morse 1 sibling, 1 reply; 25+ messages in thread From: York Sun @ 2018-02-01 20:56 UTC (permalink / raw) To: linux-arm-kernel On 01/15/2018 03:52 PM, Borislav Petkov wrote: > On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote: >> It is generic ARM64 thing. I believe only SError interrupt is available. > > So if it is, then I'd suggest you hammer out a proper design with the > ARM folks. > I made some progress and need some help on coding. On the platform I am working on, it has A53 cores. Each A53 core has a signal nINTERRIRQ. They are connected to one GIC interrupt. I managed to inject errors to some safe address without triggering system error and I got the interrupt. I will need to find out which core has errors by reading register on each core (and clear the interrupt). How can I do this within interrupt service routine? I tried to use smp_call_function_single() but it doesn't like the IRQ being disabled. Any suggestion is appreciated! York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-02-01 20:56 ` York Sun @ 2018-02-08 15:31 ` James Morse 2018-02-08 15:53 ` York Sun 0 siblings, 1 reply; 25+ messages in thread From: James Morse @ 2018-02-08 15:31 UTC (permalink / raw) To: linux-arm-kernel Hi York, On 01/02/18 20:56, York Sun wrote: > On 01/15/2018 03:52 PM, Borislav Petkov wrote: >> On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote: >>> It is generic ARM64 thing. I believe only SError interrupt is available. >> >> So if it is, then I'd suggest you hammer out a proper design with the >> ARM folks. > I made some progress and need some help on coding. On the platform I am > working on, it has A53 cores. Each A53 core has a signal nINTERRIRQ. > They are connected to one GIC interrupt. Is this a fatal signal for the CPU that should have received it? The signals start out as being per-cpu, but configured like this you can only take the interrupt one one CPU... (is this thing edge or level triggered?) > I managed to inject errors to some safe address without triggering system > error and I got the interrupt. (okay, sounds like its a corrected error) > I will need to find out which core has errors by reading > register on each core (and clear the interrupt). How can I do this > within interrupt service routine? I tried to use > smp_call_function_single() but it doesn't like the IRQ being disabled. mm/memory-failure.c:memory_failure_queue() has an example of how you could do this. It uses 'schedule_work_on()' to re-run after the IRQ, from there you should be able to call something like on_each_cpu() or smp_call_function_many(). If this thing is level triggered you can't escape the irq-handler until its cleared. If it needs clearing by a remote CPU this is going to be a problem. Thanks, James ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-02-08 15:31 ` James Morse @ 2018-02-08 15:53 ` York Sun 0 siblings, 0 replies; 25+ messages in thread From: York Sun @ 2018-02-08 15:53 UTC (permalink / raw) To: linux-arm-kernel On 02/08/2018 07:33 AM, James Morse wrote: > Hi York, > > On 01/02/18 20:56, York Sun wrote: >> On 01/15/2018 03:52 PM, Borislav Petkov wrote: >>> On Mon, Jan 15, 2018 at 11:28:14PM +0000, York Sun wrote: >>>> It is generic ARM64 thing. I believe only SError interrupt is available. >>> >>> So if it is, then I'd suggest you hammer out a proper design with the >>> ARM folks. > >> I made some progress and need some help on coding. On the platform I am >> working on, it has A53 cores. Each A53 core has a signal nINTERRIRQ. >> They are connected to one GIC interrupt. > > Is this a fatal signal for the CPU that should have received it? The signals > start out as being per-cpu, but configured like this you can only take the > interrupt one one CPU... > > (is this thing edge or level triggered?) > Level > >> I managed to inject errors to some safe address without triggering system >> error and I got the interrupt. > > (okay, sounds like its a corrected error) Double-bit error. I got the interrupt nINTERRIRQ. > > >> I will need to find out which core has errors by reading >> register on each core (and clear the interrupt). How can I do this >> within interrupt service routine? I tried to use >> smp_call_function_single() but it doesn't like the IRQ being disabled. > > mm/memory-failure.c:memory_failure_queue() has an example of how you could do > this. It uses 'schedule_work_on()' to re-run after the IRQ, from there you > should be able to call something like on_each_cpu() or smp_call_function_many(). > > If this thing is level triggered you can't escape the irq-handler until its > cleared. If it needs clearing by a remote CPU this is going to be a problem. > Yeah. For now I use smp_call_function_single_async() and it looks OK. I will send out patch for review after clearing some other hardware questions with ARM support. York ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 17:12 ` Borislav Petkov 2018-01-12 17:17 ` York Sun @ 2018-01-12 17:23 ` Mark Rutland 1 sibling, 0 replies; 25+ messages in thread From: Mark Rutland @ 2018-01-12 17:23 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jan 12, 2018 at 06:12:51PM +0100, Borislav Petkov wrote: > On Fri, Jan 12, 2018 at 04:48:05PM +0000, York Sun wrote: > > I see Stratix10 has A53 core. I am concerned on reading the > > CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED > > registers. They may not be available on all SoCs, or all time. > > Is there something like CPUID on x86, on ARM64 which denotes presence of > a certain feature? > > Or is that thing devicetree? We have ID registers, like CPUID, but for various reasons those aren't sufficient to guarantee that IMPLEMENTATION DEFINED features can be used. If we want to support this, we'd certainly need something in DT. For ACPI systems I'd expect this to be hidden behind APEI or similar, with the kernel staying well clear of these registers. Thanks, Mark. ^ permalink raw reply [flat|nested] 25+ messages in thread
* EDAC driver for ARMv8 L1/L2 cache 2018-01-12 16:48 ` York Sun 2018-01-12 17:12 ` Borislav Petkov @ 2018-01-12 17:39 ` Thor Thayer 1 sibling, 0 replies; 25+ messages in thread From: Thor Thayer @ 2018-01-12 17:39 UTC (permalink / raw) To: linux-arm-kernel On 01/12/2018 10:48 AM, York Sun wrote: > On 01/12/2018 08:35 AM, Thor Thayer wrote: >> On 01/09/2018 03:51 PM, York Sun wrote: >>> On 01/09/2018 01:43 PM, Borislav Petkov wrote: >>>> Adding some more people to CC. >>>> >>>> On Tue, Jan 09, 2018 at 08:48:43PM +0000, York Sun wrote: >>>>> Borislav, >>>>> >>>>> Are you aware of any existing (or in development) EDAC driver for ARMv8 >>>>> L1/L2 cache? I am thinking to write one if not available yet. >>>> >>>> no I'm not but I see two EDAC drivers for ARM64: thunderx and xgene. >>>> >>>> Please synchronize with their authors and ARM people what would be the >>>> best thing to do and try extracting shared functionality from them >>>> into a common compilation unit instead of duplicating it. I don't want >>>> separate drivers per functional unit. >>>> >>> >>> Thanks for the pointer. Thunderx and xgene's drivers have different >>> implementation on the hardware. I found one patch closer to what I >>> expect, https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F7513231%2F&data=02%7C01%7Cyork.sun%40nxp.com%7C29cd81c44431418e2a6308d559da7d3b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636513717317947616&sdata=flIZL8mxGc4VWbTpl5nPjkRXaB8uNjcc382HcXO08qE%3D&reserved=0. I don't see >>> activities after 2015. I will reach out to the author. >>> >>> York >>> >> >> Hi York, >> >> I'll be adding EDAC support for our Stratix10 (Quad ARM Cortex A-53) at >> some point but I'm not sure when it is scheduled - probably in the next >> few months. I'll be going through the same process of comparing our >> architecture to others. Below are links to the Stratix10. >> > > I see Stratix10 has A53 core. I am concerned on reading the > CPUMERRSR_EL1 and L2MERRSR_EL1. The are IMPLEMENTATION DEFINED > registers. They may not be available on all SoCs, or all time. The > CPUMERRSR_EL1 needs to be read on each core, the L2MERRSR_EL1 needs to > be read on each cluster. I am not sure how to handle them in EDAC driver > yet. I am hoping Mark Rutland can shed some light as he commented on > similar patches [1][2] before. > > York > > [1] https://patchwork.kernel.org/patch/7460391/ > [2] https://patchwork.kernel.org/patch/7513231/ I haven't started digging into EDAC for the Stratix10 yet so I don't have a comment but I'll check out your links. Thanks! > -- > To unsubscribe from this list: send the line "unsubscribe linux-edac" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2018-02-08 15:53 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <VI1PR04MB2078DBC9381EA574CE5DA4B09A100@VI1PR04MB2078.eurprd04.prod.outlook.com> 2018-01-09 21:43 ` EDAC driver for ARMv8 L1/L2 cache Borislav Petkov 2018-01-09 21:51 ` York Sun 2018-01-09 22:00 ` Borislav Petkov 2018-01-12 16:38 ` Thor Thayer 2018-01-12 16:48 ` York Sun 2018-01-12 17:12 ` Borislav Petkov 2018-01-12 17:17 ` York Sun 2018-01-12 17:38 ` Mark Rutland 2018-01-12 17:44 ` York Sun 2018-01-12 18:00 ` Mark Rutland 2018-01-12 18:16 ` York Sun 2018-01-13 11:31 ` Borislav Petkov 2018-01-15 14:21 ` Mark Rutland 2018-01-15 14:32 ` Borislav Petkov 2018-01-15 14:49 ` Mark Rutland 2018-01-15 16:19 ` York Sun 2018-01-15 23:23 ` Borislav Petkov 2018-01-15 23:28 ` York Sun 2018-01-15 23:52 ` Borislav Petkov 2018-01-16 0:16 ` York Sun 2018-02-01 20:56 ` York Sun 2018-02-08 15:31 ` James Morse 2018-02-08 15:53 ` York Sun 2018-01-12 17:23 ` Mark Rutland 2018-01-12 17:39 ` Thor Thayer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.