util-linux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: FW: edac driver initialization, interrupt, & debug
       [not found]   ` <CAChUvXMp6S6MBY_LmrfgdPcctQw70FoyxbiHeFqK+5fQx5omCw@mail.gmail.com>
@ 2018-11-16 17:07     ` Tracy Smith
  2018-11-17 14:05       ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-11-16 17:07 UTC (permalink / raw)
  To: linux-edac; +Cc: backports, linux-newbie, util-linux, linux-mmc

I’m attempting to insmod/modprobe the layerscape_edac_mod.ko driver.
It seems the driver enters layerscape_edac.c fsl_ddr_mc_init() and
completes successfully. But there is no EDAC boot messages and no
/proc/interrupts entry for the EDAC. I’m backporting the EDAC from the
LSDK to the SDK 2.0.

I have set CONFIG_EDAC_DEBUG and set edac_debug_level to 4, but I
don’t see any debug messages other than printk()s that I add to
fsl_ddr_mc_init() in layerscape_edac.c. No debug messages appear in
any logs from fsl_ddr_edac.c.

1. How can I enable debug information? Is debugfs required to print
the debug messages for the edac_debug_level and CONFIG_EDAC_DEBUG in
the 4.1.35-rt41 kernel for drivers/edac?

2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
platform_driver_register() is successful. But I don’t see any printk()
messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
in any /var/log/*.

3. I don’t see any interrupts, so why would there not be an edac
interrupt in /proc/inturrupts?  Do I need to inject an error before
seeing an edac interrupt in /proc/interrupts?

lsmod
module: layerscape_edac_mod    12594  0

4. To inject an error I can use the fsl_mc_inject …. routines in
fsl_ddr_edac.c and write to the registers. But is there a utility that
already uses these routines that can be used to inject an error
(FSL_MC_ECC_ERR_INJECT, FSL_MC_DATA_ERR_INJECT_LO,

Thanks you for any assistance. Instrumenting throughout the driver now
to see if I can trace through the driver.

thx, Tracy Smith

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: edac driver initialization, interrupt, & debug
  2018-11-16 17:07     ` FW: edac driver initialization, interrupt, & debug Tracy Smith
@ 2018-11-17 14:05       ` Borislav Petkov
  2018-11-17 23:22         ` Tracy Smith
  2018-11-19 15:55         ` York Sun
  0 siblings, 2 replies; 24+ messages in thread
From: Borislav Petkov @ 2018-11-17 14:05 UTC (permalink / raw)
  To: Tracy Smith, York Sun
  Cc: linux-edac, backports, linux-newbie, util-linux, linux-mmc

+ York.

On Fri, Nov 16, 2018 at 11:07:50AM -0600, Tracy Smith wrote:
> I’m attempting to insmod/modprobe the layerscape_edac_mod.ko driver.
> It seems the driver enters layerscape_edac.c fsl_ddr_mc_init() and
> completes successfully. But there is no EDAC boot messages and no
> /proc/interrupts entry for the EDAC. I’m backporting the EDAC from the
> LSDK to the SDK 2.0.
> 
> I have set CONFIG_EDAC_DEBUG and set edac_debug_level to 4, but I
> don’t see any debug messages other than printk()s that I add to
> fsl_ddr_mc_init() in layerscape_edac.c. No debug messages appear in
> any logs from fsl_ddr_edac.c.
> 
> 1. How can I enable debug information? Is debugfs required to print
> the debug messages for the edac_debug_level and CONFIG_EDAC_DEBUG in
> the 4.1.35-rt41 kernel for drivers/edac?

No, just slap printks before every return statement, like:

        if (!devres_open_group(&op->dev, fsl_mc_err_probe, GFP_KERNEL)) {
		pr_err("%s: Error devres_open_group()\n", __func__);
                return -ENOMEM;
	}

so that you can get closer to the place where it fails.

> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
> platform_driver_register() is successful. But I don’t see any printk()
> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
> in any /var/log/*.

Yeah, see if it even gets called at all:

int fsl_mc_err_probe(struct platform_device *op)
{
        struct mem_ctl_info *mci;
        struct edac_mc_layer layers[2];
        struct fsl_mc_pdata *pdata;
        struct resource r;
        u32 sdram_ctl;
        int res;

	pr_err("%s: entry\n", __func__);


> 3. I don’t see any interrupts, so why would there not be an edac
> interrupt in /proc/inturrupts?

Probably because it doesn't reach the point where it registers an IRQ
handler...

> Do I need to inject an error before seeing an edac interrupt in
> /proc/interrupts?

You should, AFAICT, if it loads and registers stuff properly.

> lsmod
> module: layerscape_edac_mod    12594  0
> 
> 4. To inject an error I can use the fsl_mc_inject …. routines in
> fsl_ddr_edac.c and write to the registers. But is there a utility that
> already uses these routines that can be used to inject an error
> (FSL_MC_ECC_ERR_INJECT, FSL_MC_DATA_ERR_INJECT_LO,

You should be able to simply write to *sysfs*. Somewhere under
/sys/devices/system/edac/...

fsl_mc_inject_data_{lo,hi}_store simply writes the low and high inject
register.

Btw, looking at it, York, this whole injection functionality needs to
be behind CONFIG_EDAC_DEBUG because a production driver shouldn't have
injection capability.

Hmmm.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: edac driver initialization, interrupt, & debug
  2018-11-17 14:05       ` Borislav Petkov
@ 2018-11-17 23:22         ` Tracy Smith
  2018-11-18  1:05           ` Steve Inkpen
  2018-11-19 16:24           ` FW: edac driver initialization, interrupt, & debug York Sun
  2018-11-19 15:55         ` York Sun
  1 sibling, 2 replies; 24+ messages in thread
From: Tracy Smith @ 2018-11-17 23:22 UTC (permalink / raw)
  To: bp; +Cc: york.sun, linux-edac, backports, linux-newbie, util-linux, linux-mmc

Thank you Boris for the information.  It is helpful.

>> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
>> platform_driver_register() is successful. But I don’t see any printk()
>> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
>> in any /var/log/*.

> Yeah, see if it even gets called at all:

I did a grep on /var/log/* and don't see any printk's from
fsl_mc_err_probe(). So, it's not being called.

1) What would cause the probe function not to be called?

2) Were changes made in how .probe functions were called between
different kernel releases of the edac?

3) How should I go about root causing the reason for the .probe to
fail since I may have to backport any changes made?

4) Possibly a patch exists for .probe changes after 4.1.35-rt41?

static struct platform_driver fsl_ddr_mc_err_driver = {

.probe = fsl_mc_err_probe,
.remove = fsl_mc_err_remove,
.driver = {
              .name = "fsl_ddr_mc_err",
              .of_match_table = fsl_ddr_mc_err_of_match,
    },
}l;

int fsl_mc_err_probe(struct platform_device *op)

{
 struct mem_ctl_info *mci;
 struct edac_mc_layer layers[2];
 struct fsl_mc_pdata *pdata;
 struct resource r;
 u32 sdram_ctl;
 int res;

pr_err("%s: entry\n", __func__);
printk("entered fsl_mc_err_probe!\n");

Any assistance greatly appreciated.


On Sat, Nov 17, 2018 at 8:05 AM Borislav Petkov <bp@alien8.de> wrote:
>
> + York.
>
> On Fri, Nov 16, 2018 at 11:07:50AM -0600, Tracy Smith wrote:
> > I’m attempting to insmod/modprobe the layerscape_edac_mod.ko driver.
> > It seems the driver enters layerscape_edac.c fsl_ddr_mc_init() and
> > completes successfully. But there is no EDAC boot messages and no
> > /proc/interrupts entry for the EDAC. I’m backporting the EDAC from the
> > LSDK to the SDK 2.0.
> >
> > I have set CONFIG_EDAC_DEBUG and set edac_debug_level to 4, but I
> > don’t see any debug messages other than printk()s that I add to
> > fsl_ddr_mc_init() in layerscape_edac.c. No debug messages appear in
> > any logs from fsl_ddr_edac.c.
> >
> > 1. How can I enable debug information? Is debugfs required to print
> > the debug messages for the edac_debug_level and CONFIG_EDAC_DEBUG in
> > the 4.1.35-rt41 kernel for drivers/edac?
>
> No, just slap printks before every return statement, like:
>
>         if (!devres_open_group(&op->dev, fsl_mc_err_probe, GFP_KERNEL)) {
>                 pr_err("%s: Error devres_open_group()\n", __func__);
>                 return -ENOMEM;
>         }
>
> so that you can get closer to the place where it fails.
>
> > 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
> > platform_driver_register() is successful. But I don’t see any printk()
> > messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
> > in any /var/log/*.
>
> Yeah, see if it even gets called at all:
>
> int fsl_mc_err_probe(struct platform_device *op)
> {
>         struct mem_ctl_info *mci;
>         struct edac_mc_layer layers[2];
>         struct fsl_mc_pdata *pdata;
>         struct resource r;
>         u32 sdram_ctl;
>         int res;
>
>         pr_err("%s: entry\n", __func__);
>
>
> > 3. I don’t see any interrupts, so why would there not be an edac
> > interrupt in /proc/inturrupts?
>
> Probably because it doesn't reach the point where it registers an IRQ
> handler...
>
> > Do I need to inject an error before seeing an edac interrupt in
> > /proc/interrupts?
>
> You should, AFAICT, if it loads and registers stuff properly.
>
> > lsmod
> > module: layerscape_edac_mod    12594  0
> >
> > 4. To inject an error I can use the fsl_mc_inject …. routines in
> > fsl_ddr_edac.c and write to the registers. But is there a utility that
> > already uses these routines that can be used to inject an error
> > (FSL_MC_ECC_ERR_INJECT, FSL_MC_DATA_ERR_INJECT_LO,
>
> You should be able to simply write to *sysfs*. Somewhere under
> /sys/devices/system/edac/...
>
> fsl_mc_inject_data_{lo,hi}_store simply writes the low and high inject
> register.
>
> Btw, looking at it, York, this whole injection functionality needs to
> be behind CONFIG_EDAC_DEBUG because a production driver shouldn't have
> injection capability.
>
> Hmmm.
>
> --
> Regards/Gruss,
>     Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver initialization, interrupt, & debug
  2018-11-17 23:22         ` Tracy Smith
@ 2018-11-18  1:05           ` Steve Inkpen
  2018-11-19 16:37             ` Tracy Smith
  2018-11-19 16:24           ` FW: edac driver initialization, interrupt, & debug York Sun
  1 sibling, 1 reply; 24+ messages in thread
From: Steve Inkpen @ 2018-11-18  1:05 UTC (permalink / raw)
  To: Tracy Smith
  Cc: bp, york.sun, linux-edac, backports, linux-newbie, util-linux, linux-mmc

Are you using a device tree? If yes, make sure there is an entry for this device.

From your snippet of code, there appears to be a match entry in of_match_table?

Steve

> On Nov 17, 2018, at 6:22 PM, Tracy Smith <tlsmith3777@gmail.com> wrote:
> 
> Thank you Boris for the information.  It is helpful.
> 
>>> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
>>> platform_driver_register() is successful. But I don’t see any printk()
>>> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
>>> in any /var/log/*.
> 
>> Yeah, see if it even gets called at all:
> 
> I did a grep on /var/log/* and don't see any printk's from
> fsl_mc_err_probe(). So, it's not being called.
> 
> 1) What would cause the probe function not to be called?
> 
> 2) Were changes made in how .probe functions were called between
> different kernel releases of the edac?
> 
> 3) How should I go about root causing the reason for the .probe to
> fail since I may have to backport any changes made?
> 
> 4) Possibly a patch exists for .probe changes after 4.1.35-rt41?
> 
> static struct platform_driver fsl_ddr_mc_err_driver = {
> 
> .probe = fsl_mc_err_probe,
> .remove = fsl_mc_err_remove,
> .driver = {
>              .name = "fsl_ddr_mc_err",
>              .of_match_table = fsl_ddr_mc_err_of_match,
>    },
> }l;
> 
> int fsl_mc_err_probe(struct platform_device *op)
> 
> {
> struct mem_ctl_info *mci;
> struct edac_mc_layer layers[2];
> struct fsl_mc_pdata *pdata;
> struct resource r;
> u32 sdram_ctl;
> int res;
> 
> pr_err("%s: entry\n", __func__);
> printk("entered fsl_mc_err_probe!\n");
> 
> Any assistance greatly appreciated.
> 
> 
>> On Sat, Nov 17, 2018 at 8:05 AM Borislav Petkov <bp@alien8.de> wrote:
>> 
>> + York.
>> 
>>> On Fri, Nov 16, 2018 at 11:07:50AM -0600, Tracy Smith wrote:
>>> I’m attempting to insmod/modprobe the layerscape_edac_mod.ko driver.
>>> It seems the driver enters layerscape_edac.c fsl_ddr_mc_init() and
>>> completes successfully. But there is no EDAC boot messages and no
>>> /proc/interrupts entry for the EDAC. I’m backporting the EDAC from the
>>> LSDK to the SDK 2.0.
>>> 
>>> I have set CONFIG_EDAC_DEBUG and set edac_debug_level to 4, but I
>>> don’t see any debug messages other than printk()s that I add to
>>> fsl_ddr_mc_init() in layerscape_edac.c. No debug messages appear in
>>> any logs from fsl_ddr_edac.c.
>>> 
>>> 1. How can I enable debug information? Is debugfs required to print
>>> the debug messages for the edac_debug_level and CONFIG_EDAC_DEBUG in
>>> the 4.1.35-rt41 kernel for drivers/edac?
>> 
>> No, just slap printks before every return statement, like:
>> 
>>        if (!devres_open_group(&op->dev, fsl_mc_err_probe, GFP_KERNEL)) {
>>                pr_err("%s: Error devres_open_group()\n", __func__);
>>                return -ENOMEM;
>>        }
>> 
>> so that you can get closer to the place where it fails.
>> 
>>> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
>>> platform_driver_register() is successful. But I don’t see any printk()
>>> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
>>> in any /var/log/*.
>> 
>> Yeah, see if it even gets called at all:
>> 
>> int fsl_mc_err_probe(struct platform_device *op)
>> {
>>        struct mem_ctl_info *mci;
>>        struct edac_mc_layer layers[2];
>>        struct fsl_mc_pdata *pdata;
>>        struct resource r;
>>        u32 sdram_ctl;
>>        int res;
>> 
>>        pr_err("%s: entry\n", __func__);
>> 
>> 
>>> 3. I don’t see any interrupts, so why would there not be an edac
>>> interrupt in /proc/inturrupts?
>> 
>> Probably because it doesn't reach the point where it registers an IRQ
>> handler...
>> 
>>> Do I need to inject an error before seeing an edac interrupt in
>>> /proc/interrupts?
>> 
>> You should, AFAICT, if it loads and registers stuff properly.
>> 
>>> lsmod
>>> module: layerscape_edac_mod    12594  0
>>> 
>>> 4. To inject an error I can use the fsl_mc_inject …. routines in
>>> fsl_ddr_edac.c and write to the registers. But is there a utility that
>>> already uses these routines that can be used to inject an error
>>> (FSL_MC_ECC_ERR_INJECT, FSL_MC_DATA_ERR_INJECT_LO,
>> 
>> You should be able to simply write to *sysfs*. Somewhere under
>> /sys/devices/system/edac/...
>> 
>> fsl_mc_inject_data_{lo,hi}_store simply writes the low and high inject
>> register.
>> 
>> Btw, looking at it, York, this whole injection functionality needs to
>> be behind CONFIG_EDAC_DEBUG because a production driver shouldn't have
>> injection capability.
>> 
>> Hmmm.
>> 
>> --
>> Regards/Gruss,
>>    Boris.
>> 
>> Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> 
> 
> -- 
> Confidentiality notice: This e-mail message, including any
> attachments, may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), please
> immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: edac driver initialization, interrupt, & debug
  2018-11-17 14:05       ` Borislav Petkov
  2018-11-17 23:22         ` Tracy Smith
@ 2018-11-19 15:55         ` York Sun
  1 sibling, 0 replies; 24+ messages in thread
From: York Sun @ 2018-11-19 15:55 UTC (permalink / raw)
  To: Borislav Petkov, Tracy Smith
  Cc: linux-edac, backports, linux-newbie, util-linux, linux-mmc

On 11/17/18 6:05 AM, Borislav Petkov wrote:
> 
> Btw, looking at it, York, this whole injection functionality needs to
> be behind CONFIG_EDAC_DEBUG because a production driver shouldn't have
> injection capability.

I can make the change.

York

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: edac driver initialization, interrupt, & debug
  2018-11-17 23:22         ` Tracy Smith
  2018-11-18  1:05           ` Steve Inkpen
@ 2018-11-19 16:24           ` York Sun
  1 sibling, 0 replies; 24+ messages in thread
From: York Sun @ 2018-11-19 16:24 UTC (permalink / raw)
  To: Tracy Smith, bp
  Cc: linux-edac, backports, linux-newbie, util-linux, linux-mmc

On 11/17/18 3:22 PM, Tracy Smith wrote:
> Thank you Boris for the information.  It is helpful.
> 
>>> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
>>> platform_driver_register() is successful. But I don’t see any printk()
>>> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
>>> in any /var/log/*.
> 
>> Yeah, see if it even gets called at all:
> 
> I did a grep on /var/log/* and don't see any printk's from
> fsl_mc_err_probe(). So, it's not being called.
> 
> 1) What would cause the probe function not to be called?
> 
> 2) Were changes made in how .probe functions were called between
> different kernel releases of the edac?
> 
> 3) How should I go about root causing the reason for the .probe to
> fail since I may have to backport any changes made?
> 
> 4) Possibly a patch exists for .probe changes after 4.1.35-rt41?
> 
> static struct platform_driver fsl_ddr_mc_err_driver = {
> 
> .probe = fsl_mc_err_probe,
> .remove = fsl_mc_err_remove,
> .driver = {
>               .name = "fsl_ddr_mc_err",
>               .of_match_table = fsl_ddr_mc_err_of_match,
>     },
> }l;
> 
> int fsl_mc_err_probe(struct platform_device *op)
> 
> {
>  struct mem_ctl_info *mci;
>  struct edac_mc_layer layers[2];
>  struct fsl_mc_pdata *pdata;
>  struct resource r;
>  u32 sdram_ctl;
>  int res;
> 
> pr_err("%s: entry\n", __func__);
> printk("entered fsl_mc_err_probe!\n");
> 
> Any assistance greatly appreciated.
> 
> 
> On Sat, Nov 17, 2018 at 8:05 AM Borislav Petkov <bp@alien8.de> wrote:
>>
>> + York.
>>
>> On Fri, Nov 16, 2018 at 11:07:50AM -0600, Tracy Smith wrote:
>>> I’m attempting to insmod/modprobe the layerscape_edac_mod.ko driver.
>>> It seems the driver enters layerscape_edac.c fsl_ddr_mc_init() and
>>> completes successfully. But there is no EDAC boot messages and no
>>> /proc/interrupts entry for the EDAC. I’m backporting the EDAC from the
>>> LSDK to the SDK 2.0.

Tracy,

You said you were backporting to an older release. Please double check
your device tree _after_ Linux reaches prompt. You can find it at
/proc/device-tree. You can also check the device tree in U-Boot using
bootm command step-by-step, after the fixup.

As long as you have correct "compatible", the probe function should be
called. It may quit if ECC is not detected, or not register interrupt if
not found in device tree.

York

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver initialization, interrupt, & debug
  2018-11-18  1:05           ` Steve Inkpen
@ 2018-11-19 16:37             ` Tracy Smith
  2018-11-19 16:48               ` York Sun
  0 siblings, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-11-19 16:37 UTC (permalink / raw)
  To: steve
  Cc: bp, york.sun, linux-edac, backports, linux-newbie, util-linux, linux-mmc

Steve, you were correct, there wasn't a device tree entry for the
qoriq memory controller in
arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi.  I added it making it
identical to the fsl-ls1046s.dtsi, which should have the same memory
controller and entry as the ls1043a.  I added this but it didn't make
a difference as far as being able to call the probe function. I'm now
checking the mpc85xx_edac.c dtsi entry for comparison since York used
the mpc85xx as the basis for the layerscape, but there is something
else missing preventing the probe function from being called.

@York
What is your entry for
/proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible

@York
cat /proc/device-tree/compatible entry is this, is this correct?
fsl,ls1043a-rdbfsl,ls1043a

                ddr: memory-controller@1080000 {
                         compatible = "fsl,qoriq-memory-controller";
                         reg = <0x0 0x1080000 0x0 0x1000>;
                         interrupts = <0 144 0x4>;
                         big-endian;
                 };

                 ifc: ifc@1530000 {
                         compatible = "fsl,ifc", "simple-bus";
                         reg = <0x0 0x1530000 0x0 0x10000>;
                         interrupts = <0 43 0x4>;
                 };

I haven't had to change the edac code to compile it, so it is what is
in drivers/edac. The ECC is enabled in uboot and supported by the
memory controller DDR4. I have attached the layerscape_edac_mod.mod.c
file after compiling at the end.

I see the fsl_ddr_mc_err_of_match[] for the memory controller and it
is associated with the .of_match_table = fsl_ddr_mc_err_of_match

 static const struct of_device_id fsl_ddr_mc_err_of_match[] = {
         { .compatible = "fsl,qoriq-memory-controller", },
         {},
 };
 MODULE_DEVICE_TABLE(of, fsl_ddr_mc_err_of_match);

 static struct platform_driver fsl_ddr_mc_err_driver = {
         .probe = fsl_mc_err_probe,
         .remove = fsl_mc_err_remove,
         .driver = {
                 .name = "fsl_ddr_mc_err",
                 .of_match_table = fsl_ddr_mc_err_of_match,
         },
 };

Beyond this, I only see a "of_match_device" in the altera_edac.c
driver and the highbanks below, but not in any other drivers.

cd drivers/edac
grep of_match_device * | more
altera_edac.c:  id = of_match_device(altr_sdram_ctrl_of_match, &pdev->dev);
highbank_l2_edac.c:     id = of_match_device(hb_l2_err_of_match, &pdev->dev);
highbank_mc_edac.c:     id = of_match_device(hb_ddr_ctrl_of_match, &pdev->dev);

The .of_match_table entry appears correct for the layerscape_edac.c.
York took this entry from the mpc85xx_edac.c.

 layerscape_edac.c:              .of_match_table = fsl_ddr_mc_err_of_match,
 mpc85xx_edac.c:         .of_match_table = mpc85xx_l2_err_of_match,
 mpc85xx_edac.c:         .of_match_table = mpc85xx_mc_err_of_match,
 ppc4xx_edac.c:          .of_match_table = ppc4xx_edac_match,
 synopsys_edac.c:                   .of_match_table = synps_edac_match,
 xgene_edac.c:           .of_match_table = xgene_edac_of_match,

-- 
 layerscape_edac_mod.mod.c
 #include <linux/vermagic.h>
 #include <linux/compiler.h>

MODULE_INFO(vermagic, VERMAGIC_STRING);

__visible struct module __this_module
 __attribute__((section(".gnu.linkonce.this_module"))) = {
         .name = KBUILD_MODNAME,
         .init = init_module,

 #ifdef CONFIG_MODULE_UNLOAD
         .exit = cleanup_module,
 #endif

        .arch = MODULE_ARCH_INIT,
 };

 MODULE_INFO(intree, "Y");

 static const struct modversion_info ____versions[]
 __used
 __attribute__((section("__versions"))) = {
         { 0xf41fc8a9, __VMLINUX_SYMBOL_STR(module_layout) },
         { 0x8294e3fc, __VMLINUX_SYMBOL_STR(edac_mc_add_mc_with_groups) },
         { 0x1fdc7df2, __VMLINUX_SYMBOL_STR(_mcount) },
         { 0x51eafc8e, __VMLINUX_SYMBOL_STR(param_ops_int) },
         { 0x69a358a6, __VMLINUX_SYMBOL_STR(iomem_resource) },
         { 0x91715312, __VMLINUX_SYMBOL_STR(sprintf) },
         { 0x26d622f, __VMLINUX_SYMBOL_STR(__platform_driver_register) },
         { 0x60ea2d6, __VMLINUX_SYMBOL_STR(kstrtoull) },
         { 0x11089ac7, __VMLINUX_SYMBOL_STR(_ctype) },
         { 0xb51fbd64, __VMLINUX_SYMBOL_STR(edac_op_state) },
         { 0x27e1a049, __VMLINUX_SYMBOL_STR(printk) },
         { 0x1e614c08, __VMLINUX_SYMBOL_STR(of_find_property) },
         { 0x1215bb3b, __VMLINUX_SYMBOL_STR(devres_open_group) },
         { 0x91f27fc0, __VMLINUX_SYMBOL_STR(edac_mc_handle_error) },
         { 0x643ce492, __VMLINUX_SYMBOL_STR(edac_mc_free) },
         { 0x9b69ee39, __VMLINUX_SYMBOL_STR(edac_debug_level) },
         { 0x9c26e7ff, __VMLINUX_SYMBOL_STR(edac_mc_alloc) },
         { 0x3c5bfeaa, __VMLINUX_SYMBOL_STR(__devm_request_region) },
         { 0x8229211c, __VMLINUX_SYMBOL_STR(devm_ioremap) },
         { 0x159dd96a, __VMLINUX_SYMBOL_STR(edac_mc_del_mc) },
         { 0xb3d3cbde, __VMLINUX_SYMBOL_STR(devres_remove_group) },
         { 0x48fb5a26, __VMLINUX_SYMBOL_STR(of_address_to_resource) },
         { 0xb2904f0, __VMLINUX_SYMBOL_STR(platform_get_irq) },
         { 0xfb3ca8db, __VMLINUX_SYMBOL_STR(platform_driver_unregister) },
         { 0x69e0f942, __VMLINUX_SYMBOL_STR(devm_request_threaded_irq) },
         { 0xec4e8f9b, __VMLINUX_SYMBOL_STR(devres_release_group) },
 };

 static const char __module_depends[]
 __used
 __attribute__((section(".modinfo"))) =
 "depends=";

Question: is this MODULE_ALIAS "of:N*T*Cfsl" correct?

MODULE_ALIAS("of:N*T*Cfsl,qoriq-memory-controller*");
 - layerscape_edac_mod.mod.c 55/55 100%

thx,
Tracy


On Sat, Nov 17, 2018 at 7:05 PM Steve Inkpen <steve@theinkpens.com> wrote:
>
> Are you using a device tree? If yes, make sure there is an entry for this device.
>
> From your snippet of code, there appears to be a match entry in of_match_table?
>
> Steve
>
> > On Nov 17, 2018, at 6:22 PM, Tracy Smith <tlsmith3777@gmail.com> wrote:
> >
> > Thank you Boris for the information.  It is helpful.
> >
> >>> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
> >>> platform_driver_register() is successful. But I don’t see any printk()
> >>> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
> >>> in any /var/log/*.
> >
> >> Yeah, see if it even gets called at all:
> >
> > I did a grep on /var/log/* and don't see any printk's from
> > fsl_mc_err_probe(). So, it's not being called.
> >
> > 1) What would cause the probe function not to be called?
> >
> > 2) Were changes made in how .probe functions were called between
> > different kernel releases of the edac?
> >
> > 3) How should I go about root causing the reason for the .probe to
> > fail since I may have to backport any changes made?
> >
> > 4) Possibly a patch exists for .probe changes after 4.1.35-rt41?
> >
> > static struct platform_driver fsl_ddr_mc_err_driver = {
> >
> > .probe = fsl_mc_err_probe,
> > .remove = fsl_mc_err_remove,
> > .driver = {
> >              .name = "fsl_ddr_mc_err",
> >              .of_match_table = fsl_ddr_mc_err_of_match,
> >    },
> > }l;
> >
> > int fsl_mc_err_probe(struct platform_device *op)
> >
> > {
> > struct mem_ctl_info *mci;
> > struct edac_mc_layer layers[2];
> > struct fsl_mc_pdata *pdata;
> > struct resource r;
> > u32 sdram_ctl;
> > int res;
> >
> > pr_err("%s: entry\n", __func__);
> > printk("entered fsl_mc_err_probe!\n");
> >
> > Any assistance greatly appreciated.
> >
> >
> >> On Sat, Nov 17, 2018 at 8:05 AM Borislav Petkov <bp@alien8.de> wrote:
> >>
> >> + York.
> >>
> >>> On Fri, Nov 16, 2018 at 11:07:50AM -0600, Tracy Smith wrote:
> >>> I’m attempting to insmod/modprobe the layerscape_edac_mod.ko driver.
> >>> It seems the driver enters layerscape_edac.c fsl_ddr_mc_init() and
> >>> completes successfully. But there is no EDAC boot messages and no
> >>> /proc/interrupts entry for the EDAC. I’m backporting the EDAC from the
> >>> LSDK to the SDK 2.0.
> >>>
> >>> I have set CONFIG_EDAC_DEBUG and set edac_debug_level to 4, but I
> >>> don’t see any debug messages other than printk()s that I add to
> >>> fsl_ddr_mc_init() in layerscape_edac.c. No debug messages appear in
> >>> any logs from fsl_ddr_edac.c.
> >>>
> >>> 1. How can I enable debug information? Is debugfs required to print
> >>> the debug messages for the edac_debug_level and CONFIG_EDAC_DEBUG in
> >>> the 4.1.35-rt41 kernel for drivers/edac?
> >>
> >> No, just slap printks before every return statement, like:
> >>
> >>        if (!devres_open_group(&op->dev, fsl_mc_err_probe, GFP_KERNEL)) {
> >>                pr_err("%s: Error devres_open_group()\n", __func__);
> >>                return -ENOMEM;
> >>        }
> >>
> >> so that you can get closer to the place where it fails.
> >>
> >>> 2. The default EDAC_OPSTATE_INT in fsl_ddr_mc_init() and the
> >>> platform_driver_register() is successful. But I don’t see any printk()
> >>> messages in fsl_mc_err_probe() within fsl_ddr_edac.c. No errors appear
> >>> in any /var/log/*.
> >>
> >> Yeah, see if it even gets called at all:
> >>
> >> int fsl_mc_err_probe(struct platform_device *op)
> >> {
> >>        struct mem_ctl_info *mci;
> >>        struct edac_mc_layer layers[2];
> >>        struct fsl_mc_pdata *pdata;
> >>        struct resource r;
> >>        u32 sdram_ctl;
> >>        int res;
> >>
> >>        pr_err("%s: entry\n", __func__);
> >>
> >>
> >>> 3. I don’t see any interrupts, so why would there not be an edac
> >>> interrupt in /proc/inturrupts?
> >>
> >> Probably because it doesn't reach the point where it registers an IRQ
> >> handler...
> >>
> >>> Do I need to inject an error before seeing an edac interrupt in
> >>> /proc/interrupts?
> >>
> >> You should, AFAICT, if it loads and registers stuff properly.
> >>
> >>> lsmod
> >>> module: layerscape_edac_mod    12594  0
> >>>
> >>> 4. To inject an error I can use the fsl_mc_inject …. routines in
> >>> fsl_ddr_edac.c and write to the registers. But is there a utility that
> >>> already uses these routines that can be used to inject an error
> >>> (FSL_MC_ECC_ERR_INJECT, FSL_MC_DATA_ERR_INJECT_LO,
> >>
> >> You should be able to simply write to *sysfs*. Somewhere under
> >> /sys/devices/system/edac/...
> >>
> >> fsl_mc_inject_data_{lo,hi}_store simply writes the low and high inject
> >> register.
> >>
> >> Btw, looking at it, York, this whole injection functionality needs to
> >> be behind CONFIG_EDAC_DEBUG because a production driver shouldn't have
> >> injection capability.
> >>
> >> Hmmm.
> >>
> >> --
> >> Regards/Gruss,
> >>    Boris.
> >>
> >> Good mailing practices for 400: avoid top-posting and trim the reply.
> >
> >
> >
> > --
> > Confidentiality notice: This e-mail message, including any
> > attachments, may contain legally privileged and/or confidential
> > information. If you are not the intended recipient(s), please
> > immediately notify the sender and delete this e-mail message.



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver initialization, interrupt, & debug
  2018-11-19 16:37             ` Tracy Smith
@ 2018-11-19 16:48               ` York Sun
  2018-11-21 17:01                 ` Tracy Smith
  0 siblings, 1 reply; 24+ messages in thread
From: York Sun @ 2018-11-19 16:48 UTC (permalink / raw)
  To: Tracy Smith, steve
  Cc: bp, linux-edac, backports, linux-newbie, util-linux, linux-mmc

On 11/19/18 8:38 AM, Tracy Smith wrote:
> Steve, you were correct, there wasn't a device tree entry for the
> qoriq memory controller in
> arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi.  I added it making it
> identical to the fsl-ls1046s.dtsi, which should have the same memory
> controller and entry as the ls1043a.  I added this but it didn't make
> a difference as far as being able to call the probe function. I'm now
> checking the mpc85xx_edac.c dtsi entry for comparison since York used
> the mpc85xx as the basis for the layerscape, but there is something
> else missing preventing the probe function from being called.
> 
> @York
> What is your entry for
> /proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible

EDAC driver doesn't check IFC. Are you debugging EDAC for memory controller?

> 
> @York
> cat /proc/device-tree/compatible entry is this, is this correct?
> fsl,ls1043a-rdbfsl,ls1043a

Once again, you are using your modified code on your own board. So it is
not ls1043ardb. This compatible has nothing to do with EDAC driver.

I cannot help you with ls1043ardb because the real ls1043ardb board
doesn't support ECC. The closest board I have is ls1046ardb.

> 
>                 ddr: memory-controller@1080000 {
>                          compatible = "fsl,qoriq-memory-controller";
>                          reg = <0x0 0x1080000 0x0 0x1000>;
>                          interrupts = <0 144 0x4>;
>                          big-endian;
>                  };

This is your source code, not your final device tree. Please learn to
use "fdt" command under U-Boot to dump your device tree before booting
Linux, or check after Linux is up. For your reference, on my ls1046ardb,
I have

# cat /proc/device-tree/soc/memory-controller@1080000/compatible
fsl,qoriq-memory-controller

York

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver initialization, interrupt, & debug
  2018-11-19 16:48               ` York Sun
@ 2018-11-21 17:01                 ` Tracy Smith
  2018-11-21 22:02                   ` Tracy Smith
  2018-11-28 18:48                   ` edac driver injection of uncorrected errors & utils Tracy Smith
  0 siblings, 2 replies; 24+ messages in thread
From: Tracy Smith @ 2018-11-21 17:01 UTC (permalink / raw)
  To: york.sun
  Cc: steve, bp, linux-edac, backports, linux-newbie, util-linux, linux-mmc

Not probing the edac driver turned out to be a device tree issue as
Steve suspected. Thanks to both Steve and York, this has been resolved
and the backport is now logging ECC errors after injection. Added the
ddr qoriq-memory-controller entry since we used a different .dtsi
file.

arch/arm64/boot/dts/freescale/...ls1043a.dtsi

ddr: memory-controller@1080000
{ compatible = "fsl,qoriq-memory-controller"; reg = <0x0 0x1080000 0x0
0x1000>; interrupts = <0 144 0x4>; big-endian; };

I now need to collect and report CE and UE ECC errors and extend the
existing logging and reporting function that I currently see. After
reviewing the following document, the system logging appears different
from that given in the kernel EDAC document. I need the level of
granularity described in the edac.txt file.

https://www.mjmwired.net/kernel/Documentation/edac.txt#173 same as
kernel/Documentation/edac.txt

1)  Can I gather the system logging described below in the edac.txt
file for layerscape?

2)  Is there anything similar to the edac-utils but for ARM, or does
sysfs replace the edac-utils, or something else?

3)  What is currently used for collecting and reporting ECC errors for
ARM/EDAC beyond the kernel log and messages?
https://github.com/grondo/edac-utils

4) How is RAS reporting integrated into EDAC for error collection and reporting?

5) Has there been a patch to prevent EDAC sysfs API from reporting bogus values?
See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html

- The EDAC sysfs API will still report bogus values. So, userspace
tools like edac-utils will still use the bogus data;

- Add a new tracepoint-based way to get the binary information about
the errors.

This is the logging I currently see with layerscape EDAC. Need
something explaining these fields.

[ 407.612311] EDAC FSL_DDR MC0: Err Detect Register: 0x80000004 [
407.618182] EDAC FSL_DDR MC0: Faulty Data bit: 0
[ 407.622793] EDAC FSL_DDR MC0: Expected Data / ECC:
0x40c50901_40c50900 / 0x800000f0
[ 407.630443] EDAC FSL_DDR MC0: Captured Data / ECC: 0x40c50900_40c50901 / 0xf0
[ 407.637571] EDAC FSL_DDR MC0: Err addr: 0x3e0bfff50
[ 407.642440] EDAC FSL_DDR MC0: PFN: 0x003e0bff

This is the level of detail I need:

SYSTEM LOGGING
--------------

If logging for UEs and CEs is enabled, then system logs will contain
information indicating that errors have been detected:

EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
channel 1 "DIMM_B1": amd76x_edac

EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
channel 1 "DIMM_B1": amd76x_edac

The structure of the message is:
    the memory controller            (MC0)
    Error type                               (CE)
    memory page                         (0x283)
    offset in the page                   (0xce0)
    the byte granularity                (grain 8)
        or resolution of the error
    the error syndrome                 (0xb741)
    memory row                            (row 0)
    memory channel                     (channel 1)
    DIMM label, if set prior            (DIMM B1
    and then an optional, driver-specific message that may
            have additional information.

Both UEs and CEs with no info will lack all but memory controller, error
type, a notice of "no info" and then an optional, driver-specific error
message.

On Mon, Nov 19, 2018 at 10:48 AM York Sun <york.sun@nxp.com> wrote:
>
> On 11/19/18 8:38 AM, Tracy Smith wrote:
> > Steve, you were correct, there wasn't a device tree entry for the
> > qoriq memory controller in
> > arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi.  I added it making it
> > identical to the fsl-ls1046s.dtsi, which should have the same memory
> > controller and entry as the ls1043a.  I added this but it didn't make
> > a difference as far as being able to call the probe function. I'm now
> > checking the mpc85xx_edac.c dtsi entry for comparison since York used
> > the mpc85xx as the basis for the layerscape, but there is something
> > else missing preventing the probe function from being called.
> >
> > @York
> > What is your entry for
> > /proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible
>
> EDAC driver doesn't check IFC. Are you debugging EDAC for memory controller?
>
> >
> > @York
> > cat /proc/device-tree/compatible entry is this, is this correct?
> > fsl,ls1043a-rdbfsl,ls1043a
>
> Once again, you are using your modified code on your own board. So it is
> not ls1043ardb. This compatible has nothing to do with EDAC driver.
>
> I cannot help you with ls1043ardb because the real ls1043ardb board
> doesn't support ECC. The closest board I have is ls1046ardb.
>
> >
> >                 ddr: memory-controller@1080000 {
> >                          compatible = "fsl,qoriq-memory-controller";
> >                          reg = <0x0 0x1080000 0x0 0x1000>;
> >                          interrupts = <0 144 0x4>;
> >                          big-endian;
> >                  };
>
> This is your source code, not your final device tree. Please learn to
> use "fdt" command under U-Boot to dump your device tree before booting
> Linux, or check after Linux is up. For your reference, on my ls1046ardb,
> I have
>
> # cat /proc/device-tree/soc/memory-controller@1080000/compatible
> fsl,qoriq-memory-controller
>
> York



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver initialization, interrupt, & debug
  2018-11-21 17:01                 ` Tracy Smith
@ 2018-11-21 22:02                   ` Tracy Smith
  2018-11-28 18:48                   ` edac driver injection of uncorrected errors & utils Tracy Smith
  1 sibling, 0 replies; 24+ messages in thread
From: Tracy Smith @ 2018-11-21 22:02 UTC (permalink / raw)
  To: york.sun
  Cc: steve, bp, linux-edac, backports, linux-newbie, util-linux, linux-mmc

Please ignore the first question. I now see the expected EDAC message
in the kernel log:

EDAC MC0: 1 CE fsl_mc_err on mc#0csrow#0channel#0 (csrow:0 channel:0
page:0x5df1f offset:0xe40 grain:8 syndrome:0xe0e0)

1)  Is there anything similar to the edac-utils but for ARM instead of
x86, or does
sysfs replace the edac-utils, or is there something else for ARM?

2)  What is currently used for collecting and reporting ECC errors for
ARM/EDAC beyond the kernel log and messages?

https://github.com/grondo/edac-utils

3) How is RAS/rasdaemon reporting integrated into EDAC for error
collection and reporting?

4) Has there been a patch to prevent EDAC sysfs API from reporting bogus values?
See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html
On Wed, Nov 21, 2018 at 11:01 AM Tracy Smith <tlsmith3777@gmail.com> wrote:
>
> Not probing the edac driver turned out to be a device tree issue as
> Steve suspected. Thanks to both Steve and York, this has been resolved
> and the backport is now logging ECC errors after injection. Added the
> ddr qoriq-memory-controller entry since we used a different .dtsi
> file.
>
> arch/arm64/boot/dts/freescale/...ls1043a.dtsi
>
> ddr: memory-controller@1080000
> { compatible = "fsl,qoriq-memory-controller"; reg = <0x0 0x1080000 0x0
> 0x1000>; interrupts = <0 144 0x4>; big-endian; };
>
> I now need to collect and report CE and UE ECC errors and extend the
> existing logging and reporting function that I currently see. After
> reviewing the following document, the system logging appears different
> from that given in the kernel EDAC document. I need the level of
> granularity described in the edac.txt file.
>
> https://www.mjmwired.net/kernel/Documentation/edac.txt#173 same as
> kernel/Documentation/edac.txt
>
> 1)  Can I gather the system logging described below in the edac.txt
> file for layerscape?
>
> 2)  Is there anything similar to the edac-utils but for ARM, or does
> sysfs replace the edac-utils, or something else?
>
> 3)  What is currently used for collecting and reporting ECC errors for
> ARM/EDAC beyond the kernel log and messages?
> https://github.com/grondo/edac-utils
>
> 4) How is RAS reporting integrated into EDAC for error collection and reporting?
>
> 5) Has there been a patch to prevent EDAC sysfs API from reporting bogus values?
> See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html
>
> - The EDAC sysfs API will still report bogus values. So, userspace
> tools like edac-utils will still use the bogus data;
>
> - Add a new tracepoint-based way to get the binary information about
> the errors.
>
> This is the logging I currently see with layerscape EDAC. Need
> something explaining these fields.
>
> [ 407.612311] EDAC FSL_DDR MC0: Err Detect Register: 0x80000004 [
> 407.618182] EDAC FSL_DDR MC0: Faulty Data bit: 0
> [ 407.622793] EDAC FSL_DDR MC0: Expected Data / ECC:
> 0x40c50901_40c50900 / 0x800000f0
> [ 407.630443] EDAC FSL_DDR MC0: Captured Data / ECC: 0x40c50900_40c50901 / 0xf0
> [ 407.637571] EDAC FSL_DDR MC0: Err addr: 0x3e0bfff50
> [ 407.642440] EDAC FSL_DDR MC0: PFN: 0x003e0bff
>
> This is the level of detail I need:
>
> SYSTEM LOGGING
> --------------
>
> If logging for UEs and CEs is enabled, then system logs will contain
> information indicating that errors have been detected:
>
> EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
> channel 1 "DIMM_B1": amd76x_edac
>
> EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
> channel 1 "DIMM_B1": amd76x_edac
>
> The structure of the message is:
>     the memory controller            (MC0)
>     Error type                               (CE)
>     memory page                         (0x283)
>     offset in the page                   (0xce0)
>     the byte granularity                (grain 8)
>         or resolution of the error
>     the error syndrome                 (0xb741)
>     memory row                            (row 0)
>     memory channel                     (channel 1)
>     DIMM label, if set prior            (DIMM B1
>     and then an optional, driver-specific message that may
>             have additional information.
>
> Both UEs and CEs with no info will lack all but memory controller, error
> type, a notice of "no info" and then an optional, driver-specific error
> message.
>
> On Mon, Nov 19, 2018 at 10:48 AM York Sun <york.sun@nxp.com> wrote:
> >
> > On 11/19/18 8:38 AM, Tracy Smith wrote:
> > > Steve, you were correct, there wasn't a device tree entry for the
> > > qoriq memory controller in
> > > arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi.  I added it making it
> > > identical to the fsl-ls1046s.dtsi, which should have the same memory
> > > controller and entry as the ls1043a.  I added this but it didn't make
> > > a difference as far as being able to call the probe function. I'm now
> > > checking the mpc85xx_edac.c dtsi entry for comparison since York used
> > > the mpc85xx as the basis for the layerscape, but there is something
> > > else missing preventing the probe function from being called.
> > >
> > > @York
> > > What is your entry for
> > > /proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible
> >
> > EDAC driver doesn't check IFC. Are you debugging EDAC for memory controller?
> >
> > >
> > > @York
> > > cat /proc/device-tree/compatible entry is this, is this correct?
> > > fsl,ls1043a-rdbfsl,ls1043a
> >
> > Once again, you are using your modified code on your own board. So it is
> > not ls1043ardb. This compatible has nothing to do with EDAC driver.
> >
> > I cannot help you with ls1043ardb because the real ls1043ardb board
> > doesn't support ECC. The closest board I have is ls1046ardb.
> >
> > >
> > >                 ddr: memory-controller@1080000 {
> > >                          compatible = "fsl,qoriq-memory-controller";
> > >                          reg = <0x0 0x1080000 0x0 0x1000>;
> > >                          interrupts = <0 144 0x4>;
> > >                          big-endian;
> > >                  };
> >
> > This is your source code, not your final device tree. Please learn to
> > use "fdt" command under U-Boot to dump your device tree before booting
> > Linux, or check after Linux is up. For your reference, on my ls1046ardb,
> > I have
> >
> > # cat /proc/device-tree/soc/memory-controller@1080000/compatible
> > fsl,qoriq-memory-controller
> >
> > York
>
>
>
> --
> Confidentiality notice: This e-mail message, including any
> attachments, may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), please
> immediately notify the sender and delete this e-mail message.



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* edac driver injection of uncorrected errors & utils
  2018-11-21 17:01                 ` Tracy Smith
  2018-11-21 22:02                   ` Tracy Smith
@ 2018-11-28 18:48                   ` Tracy Smith
  2018-11-28 19:06                     ` York Sun
  1 sibling, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-11-28 18:48 UTC (permalink / raw)
  To: linux-edac, york.sun; +Cc: util-linux

Can I inject a uncorrected error or only corrected errors using the
layerscape edac driver injection via sysfs?

Is this the expected output for the edac-util on layerscape when
injecting errors?

root@ls1043ardb:~# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors

root@ls1043ardb:~# edac-util -vs
edac-util: EDAC drivers are loaded. 1 MC detected:
mc0:fsl_mc_err

root@ls1043ardb:~# edac-util
mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors

Does edac-ctl function on ARM based platforms or only on x86 and why
might it show 0MB for the memory layout for DDR4 as below?

/run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
--layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
514.
Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
          +-----------------------------------------------+
          |                      mc0                      |
          |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
----------+-----------------------------------------------+
channel0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----------+-----------------------------------------------+

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-11-28 18:48                   ` edac driver injection of uncorrected errors & utils Tracy Smith
@ 2018-11-28 19:06                     ` York Sun
  2018-11-28 19:11                       ` Tracy Smith
  0 siblings, 1 reply; 24+ messages in thread
From: York Sun @ 2018-11-28 19:06 UTC (permalink / raw)
  To: Tracy Smith, linux-edac; +Cc: util-linux

Tracy,

You can inject multiple-bit errors. You will crash the system for doing
that. I can't comment on edac-util.

York


On 11/28/18 12:49 PM, Tracy Smith wrote:
> Can I inject a uncorrected error or only corrected errors using the
> layerscape edac driver injection via sysfs?
> 
> Is this the expected output for the edac-util on layerscape when
> injecting errors?
> 
> root@ls1043ardb:~# edac-util -v
> mc0: 0 Uncorrected Errors with no DIMM info
> mc0: 0 Corrected Errors with no DIMM info
> mc0: csrow0: 0 Uncorrected Errors
> mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors
> 
> root@ls1043ardb:~# edac-util -vs
> edac-util: EDAC drivers are loaded. 1 MC detected:
> mc0:fsl_mc_err
> 
> root@ls1043ardb:~# edac-util
> mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors
> 
> Does edac-ctl function on ARM based platforms or only on x86 and why
> might it show 0MB for the memory layout for DDR4 as below?
> 
> /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
> --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
> 514.
> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
>           +-----------------------------------------------+
>           |                      mc0                      |
>           |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
> ----------+-----------------------------------------------+
> channel0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
> ----------+-----------------------------------------------+
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-11-28 19:06                     ` York Sun
@ 2018-11-28 19:11                       ` Tracy Smith
  2018-11-28 19:24                         ` York Sun
  0 siblings, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-11-28 19:11 UTC (permalink / raw)
  To: york.sun; +Cc: linux-edac, util-linux

Thanks York. Why will injecting multi-bit errors crash linux?  Is this
the case only for layerscape?  Is there a way to harden against this?

On Wed, Nov 28, 2018 at 1:06 PM York Sun <york.sun@nxp.com> wrote:
>
> Tracy,
>
> You can inject multiple-bit errors. You will crash the system for doing
> that. I can't comment on edac-util.
>
> York
>
>
> On 11/28/18 12:49 PM, Tracy Smith wrote:
> > Can I inject a uncorrected error or only corrected errors using the
> > layerscape edac driver injection via sysfs?
> >
> > Is this the expected output for the edac-util on layerscape when
> > injecting errors?
> >
> > root@ls1043ardb:~# edac-util -v
> > mc0: 0 Uncorrected Errors with no DIMM info
> > mc0: 0 Corrected Errors with no DIMM info
> > mc0: csrow0: 0 Uncorrected Errors
> > mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors
> >
> > root@ls1043ardb:~# edac-util -vs
> > edac-util: EDAC drivers are loaded. 1 MC detected:
> > mc0:fsl_mc_err
> >
> > root@ls1043ardb:~# edac-util
> > mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors
> >
> > Does edac-ctl function on ARM based platforms or only on x86 and why
> > might it show 0MB for the memory layout for DDR4 as below?
> >
> > /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
> > --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
> > 514.
> > Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> > Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> > Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> > Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >           +-----------------------------------------------+
> >           |                      mc0                      |
> >           |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
> > ----------+-----------------------------------------------+
> > channel0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
> > ----------+-----------------------------------------------+
> >
>


--
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-11-28 19:11                       ` Tracy Smith
@ 2018-11-28 19:24                         ` York Sun
  2018-11-28 22:14                           ` Tracy Smith
  0 siblings, 1 reply; 24+ messages in thread
From: York Sun @ 2018-11-28 19:24 UTC (permalink / raw)
  To: Tracy Smith; +Cc: linux-edac, util-linux

Tracy,

This DDR controller doesn't have the capability to inject limited
errors. As soon as you enable the error injection, all memory
transactions will carry the error. Since multi-bit errors are not
correctable. I don't expect Linux to work properly with these errors.

York


On 11/28/18 1:11 PM, Tracy Smith wrote:
> Thanks York. Why will injecting multi-bit errors crash linux?  Is this
> the case only for layerscape?  Is there a way to harden against this?
> 
> On Wed, Nov 28, 2018 at 1:06 PM York Sun <york.sun@nxp.com> wrote:
>>
>> Tracy,
>>
>> You can inject multiple-bit errors. You will crash the system for doing
>> that. I can't comment on edac-util.
>>
>> York
>>
>>
>> On 11/28/18 12:49 PM, Tracy Smith wrote:
>>> Can I inject a uncorrected error or only corrected errors using the
>>> layerscape edac driver injection via sysfs?
>>>
>>> Is this the expected output for the edac-util on layerscape when
>>> injecting errors?
>>>
>>> root@ls1043ardb:~# edac-util -v
>>> mc0: 0 Uncorrected Errors with no DIMM info
>>> mc0: 0 Corrected Errors with no DIMM info
>>> mc0: csrow0: 0 Uncorrected Errors
>>> mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors
>>>
>>> root@ls1043ardb:~# edac-util -vs
>>> edac-util: EDAC drivers are loaded. 1 MC detected:
>>> mc0:fsl_mc_err
>>>
>>> root@ls1043ardb:~# edac-util
>>> mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors
>>>
>>> Does edac-ctl function on ARM based platforms or only on x86 and why
>>> might it show 0MB for the memory layout for DDR4 as below?
>>>
>>> /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
>>> --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
>>> 514.
>>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
>>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
>>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
>>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
>>>           +-----------------------------------------------+
>>>           |                      mc0                      |
>>>           |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
>>> ----------+-----------------------------------------------+
>>> channel0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
>>> ----------+-----------------------------------------------+
>>>
>>
> 
> 
> --
> Confidentiality notice: This e-mail message, including any
> attachments, may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), please
> immediately notify the sender and delete this e-mail message.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-11-28 19:24                         ` York Sun
@ 2018-11-28 22:14                           ` Tracy Smith
  2018-11-28 23:44                             ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-11-28 22:14 UTC (permalink / raw)
  To: york.sun; +Cc: linux-edac, util-linux, lkml

Nothing appears in the logs or from the edac-util indicating there was
a multi-bit UE (uncorrected error). Just a crash and even then I'm not
100% certain it is caused by multi-bit errors without debugging the
crash. It happened when writing a 1 to inject_data_lo/inject_data_hi
and 0x100 to inject_ctrl.

Is there another way of creating an uncorrected error without crashing
Linux using the layerscape driver? I would like to see a UE error
collected without a Linux crash scenario because I need to validate
UEs are being collected.

Does the AMD platform, or other memory controllers crash Linux on
multi-bit errors and fail to collect uncorrected errors?  This is a
concern in the field since there is no way of knowing that multi-bit
errors occurred and that multi-bit errors caused the crash.

For production and in the field, can't have the Linux kernel or
layerscape driver crashing the kernel when there are multi-bit errors
and not giving any information on what caused the crash in the kernel
log. First, it could cost millions in high critical use cases.
Second, it is should be preventable.

So two concerns/questions:

1. Need a way to validate UE errors are captured without crashing the kernel
2. On multi-bit errors need a way to catch a UE before a kernel crash
and ideally prevent the kernel from crashing on multi-bit errors

Any recommendations?

Scenario produced on an ARM layerscape board.

echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_lo
echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_hi
echo 0x100 > /sys/devices/system/edac/mc/mc0/inject_ctrl

[495.327720] CPU: 3 PID: 1239 Comm: sh Not tainted 4.1.35-rt41#1
[  495.327723] EDAC FSL_DDR MC0: Err Detect Register: 0x80000008
[  495.327725] Hardware name: LS1043A Board (DT)
[  495.327735] task: ffff800063dd3300 ti: ffff800073358000 task.ti:
ffff800073358000
[  495.327740] PC is at 0x42cf80
[  495.327742] LR is at 0x42d20c
[  495.327745] pc : [<000000000042cf80>] lr : [<000000000042d20c>]
pstate: 20000000
[  495.327746] sp : ffff80007335bff0
[  495.327751] x29: 0000ffffd1f0b6e0 x28: 00000000004e0000
[  495.327756] x27: 000000003cdf81b0 x26: 00000000004d8000
[  495.327760] x25: 00000000004aea80 x24: 00000000004aea88
[  495.327764] x23: 00000000004e1000 x22: 00000000004c0e10
[  495.327768] x21: 00000000004aed98 x20: 00000000004ae868
[  495.327772] x19: 00000000004ae868 x18: 0000000000000015
[  495.327776] x17: 0000ffff7a24fb48 x16: 00000000004d8638
[  495.327781] x15: 002372c270000000 x14: ffffffffffffffff
[  495.327785] x13: 0000000000000018 x12: 0000000000000028
[  495.327789] x11: 0000000000000038 x10: 0101010101010101
[  495.327793] x9 : fefefefefefefeff x8 : 000000003ce19f50
[  495.327797] x7 : 0000ffffd1f0b9e8 x6 : 0000000000000000
[  495.327801] x5 : 00000000004e1dd0 x4 : 000000003ce19e50
[  495.327805] x3 : 0000000000000000 x2 : 0000ffffd1f0b7f0
[  495.327809] x1 : 0000ffffd1f0b7e0 x0 : 00000000004ae868
[  495.327810]
[  495.327817] Unhandled fault: synchronous external abort
(0x96000210) at 0xffff800000e1ec10
On Wed, Nov 28, 2018 at 1:24 PM York Sun <york.sun@nxp.com> wrote:
>
> Tracy,
>
> This DDR controller doesn't have the capability to inject limited
> errors. As soon as you enable the error injection, all memory
> transactions will carry the error. Since multi-bit errors are not
> correctable. I don't expect Linux to work properly with these errors.
>
> York
>
>
> On 11/28/18 1:11 PM, Tracy Smith wrote:
> > Thanks York. Why will injecting multi-bit errors crash linux?  Is this
> > the case only for layerscape?  Is there a way to harden against this?
> >
> > On Wed, Nov 28, 2018 at 1:06 PM York Sun <york.sun@nxp.com> wrote:
> >>
> >> Tracy,
> >>
> >> You can inject multiple-bit errors. You will crash the system for doing
> >> that. I can't comment on edac-util.
> >>
> >> York
> >>
> >>
> >> On 11/28/18 12:49 PM, Tracy Smith wrote:
> >>> Can I inject a uncorrected error or only corrected errors using the
> >>> layerscape edac driver injection via sysfs?
> >>>
> >>> Is this the expected output for the edac-util on layerscape when
> >>> injecting errors?
> >>>
> >>> root@ls1043ardb:~# edac-util -v
> >>> mc0: 0 Uncorrected Errors with no DIMM info
> >>> mc0: 0 Corrected Errors with no DIMM info
> >>> mc0: csrow0: 0 Uncorrected Errors
> >>> mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors
> >>>
> >>> root@ls1043ardb:~# edac-util -vs
> >>> edac-util: EDAC drivers are loaded. 1 MC detected:
> >>> mc0:fsl_mc_err
> >>>
> >>> root@ls1043ardb:~# edac-util
> >>> mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors
> >>>
> >>> Does edac-ctl function on ARM based platforms or only on x86 and why
> >>> might it show 0MB for the memory layout for DDR4 as below?
> >>>
> >>> /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
> >>> --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
> >>> 514.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>>           +-----------------------------------------------+
> >>>           |                      mc0                      |
> >>>           |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
> >>> ----------+-----------------------------------------------+
> >>> channel0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
> >>> ----------+-----------------------------------------------+
> >>>
> >>
> >
> >
> > --
> > Confidentiality notice: This e-mail message, including any
> > attachments, may contain legally privileged and/or confidential
> > information. If you are not the intended recipient(s), please
> > immediately notify the sender and delete this e-mail message.
> >
>


-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-11-28 22:14                           ` Tracy Smith
@ 2018-11-28 23:44                             ` Borislav Petkov
  2018-12-05 16:37                               ` Tracy Smith
  0 siblings, 1 reply; 24+ messages in thread
From: Borislav Petkov @ 2018-11-28 23:44 UTC (permalink / raw)
  To: Tracy Smith; +Cc: york.sun, linux-edac, util-linux, lkml

On Wed, Nov 28, 2018 at 04:14:24PM -0600, Tracy Smith wrote:
> Is there another way of creating an uncorrected error without crashing
> Linux using the layerscape driver? I would like to see a UE error
> collected without a Linux crash scenario because I need to validate
> UEs are being collected.

It depends on whether the hardware is causing the crash on uncorrectable
error to prevent data corruption or the error handler is calling panic()
or somesuch. If it is the former, then you need to disable that feature
- if at all possible (no clue what that platform does).

If it is the latter, you can comment out the panic() for testing
purposes only and inject then. For an example what x86 does, see
"tolerant" here:

Documentation/x86/x86_64/machinecheck

HTH.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-11-28 23:44                             ` Borislav Petkov
@ 2018-12-05 16:37                               ` Tracy Smith
  2018-12-05 17:12                                 ` Borislav Petkov
  2018-12-05 17:59                                 ` York Sun
  0 siblings, 2 replies; 24+ messages in thread
From: Tracy Smith @ 2018-12-05 16:37 UTC (permalink / raw)
  To: bp; +Cc: york.sun, linux-edac, util-linux, lkml

This was very helpful. Tracing through the code, it doesn't do a panic
before Linux crashes from multi-bit errors because as York has
indicated, this type of memory controller doesn't limit the number of
errors.

I do have a general question about single bit errors.  The EDAC driver
corrects single bit errors by doing a scrub, is this correct?  The
edac code does not do periodic scrubs, but I see scrubs when a
correctable error is found (edac_mc_scrub_block and edac_atomic_scrub
in edac_mc.c)?

This is more directed toward York for layerscape. I see some edac code
that seem to do periodic scrubs based on intervals or scrub rate, but
that is not needed for the layerscape driver to correct errors because
errors are scrubbed when found by the edac scrub block or is it
because the memory controller itself does the correction/scrubbing
when an error is found?

thx,
Tracy



On Wed, Nov 28, 2018 at 5:44 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Nov 28, 2018 at 04:14:24PM -0600, Tracy Smith wrote:
> > Is there another way of creating an uncorrected error without crashing
> > Linux using the layerscape driver? I would like to see a UE error
> > collected without a Linux crash scenario because I need to validate
> > UEs are being collected.
>
> It depends on whether the hardware is causing the crash on uncorrectable
> error to prevent data corruption or the error handler is calling panic()
> or somesuch. If it is the former, then you need to disable that feature
> - if at all possible (no clue what that platform does).
>
> If it is the latter, you can comment out the panic() for testing
> purposes only and inject then. For an example what x86 does, see
> "tolerant" here:
>
> Documentation/x86/x86_64/machinecheck
>
> HTH.
>
> --
> Regards/Gruss,
>     Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-12-05 16:37                               ` Tracy Smith
@ 2018-12-05 17:12                                 ` Borislav Petkov
  2018-12-05 17:59                                 ` York Sun
  1 sibling, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2018-12-05 17:12 UTC (permalink / raw)
  To: Tracy Smith; +Cc: york.sun, linux-edac, util-linux, lkml

On Wed, Dec 05, 2018 at 10:37:52AM -0600, Tracy Smith wrote:
> This was very helpful.

I'm glad.

Can you do me a favor pls and not top-post when replying on a mailing
list?

Thx.

> Tracing through the code, it doesn't do a panic
> before Linux crashes from multi-bit errors because as York has
> indicated, this type of memory controller doesn't limit the number of
> errors.
> 
> I do have a general question about single bit errors.  The EDAC driver
> corrects single bit errors by doing a scrub, is this correct?  The
> edac code does not do periodic scrubs, but I see scrubs when a
> correctable error is found (edac_mc_scrub_block and edac_atomic_scrub
> in edac_mc.c)?
> 
> This is more directed toward York for layerscape.

Yes, this is all platform-specific as you can see that some arches
implement that atomic scrubbing thing. Also, not every driver sets

  mci->scrub_mode == SCRUB_SW_SRC

in order to even do the scrubbing.

HTH.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: edac driver injection of uncorrected errors & utils
  2018-12-05 16:37                               ` Tracy Smith
  2018-12-05 17:12                                 ` Borislav Petkov
@ 2018-12-05 17:59                                 ` York Sun
  2018-12-05 21:59                                   ` Patrol scrub questions Tracy Smith
  1 sibling, 1 reply; 24+ messages in thread
From: York Sun @ 2018-12-05 17:59 UTC (permalink / raw)
  To: Tracy Smith, bp; +Cc: linux-edac, util-linux, lkml

On 12/5/18 8:38 AM, Tracy Smith wrote:
> This is more directed toward York for layerscape. I see some edac code
> that seem to do periodic scrubs based on intervals or scrub rate, but
> that is not needed for the layerscape driver to correct errors because
> errors are scrubbed when found by the edac scrub block or is it
> because the memory controller itself does the correction/scrubbing
> when an error is found?

Single-bit errors are corrected by memory controller without involving
software.

York

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Patrol scrub questions
  2018-12-05 17:59                                 ` York Sun
@ 2018-12-05 21:59                                   ` Tracy Smith
  2018-12-05 22:12                                     ` York Sun
  0 siblings, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-12-05 21:59 UTC (permalink / raw)
  To: york.sun; +Cc: bp, linux-edac, util-linux, lkml

>Single-bit errors are corrected by memory controller without involving software.

Sorry for being verbose, but I need to explain the reason for the
questions below since I need to determine if a memory scrub is
required on layerscape and why.  There are multiple layers to the
problem of ECC.

First layer, there is the immediate 'correction' of a flipped bit.

This does not 'fix' the source of the error but corrects the flipped
bit for use by the processor.

Most bit flips will be due to either a transitory noise problem on the
bus, which will not be associated with any given memory cell, OR it
will be due to a cosmic-ray induced bit flip in the memory cell which
will stay 'flipped' until the location has been written to again.

The safe action is to write the ECC corrected data back to the same
'error' location in memory. Does the layerscape memory controller
without software intervention do this?

Question 1) Does the layerscape memory controller automatically
perform a write of the corrected data back to the 'error' location to
make a correction?  If not, is a patrol scrub required to do this?

Second layer, there is the risk of a double bit flip in memory.

Statistically this is very rare, but the odds significantly increase
that a double bit flip will occur in a single word when a single bit
flip goes uncorrected, giving more time for another cosmic ray induced
bit flip to occur in that word.

The layerscape memory controller can only detect a bit-flip when a
given location is read, correct?  This is different from normal DRAM
refresh routines.

If a location is not normally read, it can go 'unserviced'
indefinitely, allowing multiple bit flips to accumulate.

By periodically (once a day should be more than sufficient overkill)
reading each location in the DRAM and writing that same (automatically
ECC corrected if correction was needed) value back into the DRAM, we
drastically reduce the potential for an uncorrectable multiple bit
error to accumulate in any given word in memory.

Question 2) Again this would require the EDAC layerscape driver to do
a control scrub, correct?  If not, how is this handled by the memory
controller to avoid the need for a patrol scrub?

Third layer, there is how the memory controller handles UE errors. My
understanding is that the layerscape memory controller, can detect if
it is a single bit (correctable) error or a multi-bit error that is
not correctable. Is this the case?

An uncorrectable error in the data or the software will have
consequences ranging from negligible to critical.  From a hardware
standpoint it can't tell if it is critical so it must assume it is.

Question 3) Because the memory controller or layerscape platform must
assume a UE is critical, will a single UE on layersape cause a WDT to
be triggered and a reset to occur?

Question 4) If so, will a panic ever be called if there is a hardware
uncorrectable memory failure?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Patrol scrub questions
  2018-12-05 21:59                                   ` Patrol scrub questions Tracy Smith
@ 2018-12-05 22:12                                     ` York Sun
  2018-12-05 22:53                                       ` Layerscape behavior when a UE is detected Tracy Smith
  0 siblings, 1 reply; 24+ messages in thread
From: York Sun @ 2018-12-05 22:12 UTC (permalink / raw)
  To: Tracy Smith; +Cc: bp, linux-edac, util-linux, lkml

On 12/5/18 2:00 PM, Tracy Smith wrote:
>> Single-bit errors are corrected by memory controller without involving software.
> 
> Sorry for being verbose, but I need to explain the reason for the
> questions below since I need to determine if a memory scrub is
> required on layerscape and why.  There are multiple layers to the
> problem of ECC.
> 
> First layer, there is the immediate 'correction' of a flipped bit.
> 
> This does not 'fix' the source of the error but corrects the flipped
> bit for use by the processor.
> 
> Most bit flips will be due to either a transitory noise problem on the
> bus, which will not be associated with any given memory cell, OR it
> will be due to a cosmic-ray induced bit flip in the memory cell which
> will stay 'flipped' until the location has been written to again.
> 
> The safe action is to write the ECC corrected data back to the same
> 'error' location in memory. Does the layerscape memory controller
> without software intervention do this?
> 
> Question 1) Does the layerscape memory controller automatically
> perform a write of the corrected data back to the 'error' location to
> make a correction?  If not, is a patrol scrub required to do this?
> 

Tracy,

Layerscape SoCs have the feature to fix any detected single-bit errors.
It is not part of EDAC driver. The error is still counted so EDAC driver
can "see" this error. You can refer to SoC reference manual.

> Question 3) Because the memory controller or layerscape platform must
> assume a UE is critical, will a single UE on layersape cause a WDT to
> be triggered and a reset to occur?

No.

> 
> Question 4) If so, will a panic ever be called if there is a hardware
> uncorrectable memory failure?

No. It is up to upper layer of EDAC driver. Layerscape driver only
reports CEs and UEs.

York

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Layerscape behavior when a UE is detected
  2018-12-05 22:12                                     ` York Sun
@ 2018-12-05 22:53                                       ` Tracy Smith
  2018-12-05 22:57                                         ` York Sun
  0 siblings, 1 reply; 24+ messages in thread
From: Tracy Smith @ 2018-12-05 22:53 UTC (permalink / raw)
  To: york.sun; +Cc: bp, linux-edac, util-linux, lkml

>> Question 4) If so, will a panic ever be called if there is a hardware
>> uncorrectable memory failure?

>No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs.

Just to be clear, the upper layer of the EDAC driver will or will not
panic when a UE is detected on layerscape?

If there is no panic by the upper layer and no reset triggered by the
layerscape CPLD or memory controller, what happens on layerscape when
a UE is detected by the memory controller?

Forcing a UE by grounding a dataline caused a reset on layerscape
after a few seconds, but no panic. It is unclear why it reset, but it
appears as though a WDT was tripped. The UE was reported by EDAC and
seen in the log.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Layerscape behavior when a UE is detected
  2018-12-05 22:53                                       ` Layerscape behavior when a UE is detected Tracy Smith
@ 2018-12-05 22:57                                         ` York Sun
  2018-12-05 23:41                                           ` Layerscape UE detected and no EDAC panic Tracy Smith
  0 siblings, 1 reply; 24+ messages in thread
From: York Sun @ 2018-12-05 22:57 UTC (permalink / raw)
  To: Tracy Smith; +Cc: bp, linux-edac, util-linux, lkml

On 12/5/18 2:54 PM, Tracy Smith wrote:
>>> Question 4) If so, will a panic ever be called if there is a hardware
>>> uncorrectable memory failure?
> 
>> No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs.
> 
> Just to be clear, the upper layer of the EDAC driver will or will not
> panic when a UE is detected on layerscape?
> 
> If there is no panic by the upper layer and no reset triggered by the
> layerscape CPLD or memory controller, what happens on layerscape when
> a UE is detected by the memory controller?
> 
> Forcing a UE by grounding a dataline caused a reset on layerscape
> after a few seconds, but no panic. It is unclear why it reset, but it
> appears as though a WDT was tripped. The UE was reported by EDAC and
> seen in the log.
> 
I can't help you on that. I never tried to force errors by grounding the
signals. You have read the driver. Do you see panic? The idea is to
report the error and let upper layer to decide what to do. Sometimes
limping forward is better than reset or panic. Again, it is not driver's
responsibility.

York

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Layerscape UE detected and no EDAC panic
  2018-12-05 22:57                                         ` York Sun
@ 2018-12-05 23:41                                           ` Tracy Smith
  0 siblings, 0 replies; 24+ messages in thread
From: Tracy Smith @ 2018-12-05 23:41 UTC (permalink / raw)
  To: york.sun; +Cc: bp, linux-edac, util-linux, lkml

> I can't help you on that. I never tried to force errors by grounding the
> signals. You have read the driver. Do you see panic? The idea is to
> report the error and let upper layer to decide what to do. Sometimes
> limping forward is better than reset or panic. Again, it is not driver's
> responsibility.

Thanks for the clarification York. Yes there is panic code in the EDAC
upper layer, but no panic occurred.  A UE was printed on the serial
console, and the layerscape board reset.

The reason it did not panic is because edac_mc_panic_on_ue has to be
set at runtime. Just validated this will cause a panic when set. No
memory UE should reset the board, so this was caused because of
grounding the data line and an issue with how I'm testing for a UE not
related to a UE itself.

echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue this
is the way to force a panic on a UE error.

MODULE_PARM_DESC(edac_mc_panic_on_ue, "Panic on uncorrected error: 0=off 1=on");

So, this is validated. Produced a UE and was able to avoid a panic and
I was able to induce a panic on a UE.  I'm satisfied with this. thanks
again!!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-12-05 23:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BYAPR02MB431115EC4735AE5B7E29F2CEF6DC0@BYAPR02MB4311.namprd02.prod.outlook.com>
     [not found] ` <BYAPR02MB43110062F32BFDEA712AB371F6DC0@BYAPR02MB4311.namprd02.prod.outlook.com>
     [not found]   ` <CAChUvXMp6S6MBY_LmrfgdPcctQw70FoyxbiHeFqK+5fQx5omCw@mail.gmail.com>
2018-11-16 17:07     ` FW: edac driver initialization, interrupt, & debug Tracy Smith
2018-11-17 14:05       ` Borislav Petkov
2018-11-17 23:22         ` Tracy Smith
2018-11-18  1:05           ` Steve Inkpen
2018-11-19 16:37             ` Tracy Smith
2018-11-19 16:48               ` York Sun
2018-11-21 17:01                 ` Tracy Smith
2018-11-21 22:02                   ` Tracy Smith
2018-11-28 18:48                   ` edac driver injection of uncorrected errors & utils Tracy Smith
2018-11-28 19:06                     ` York Sun
2018-11-28 19:11                       ` Tracy Smith
2018-11-28 19:24                         ` York Sun
2018-11-28 22:14                           ` Tracy Smith
2018-11-28 23:44                             ` Borislav Petkov
2018-12-05 16:37                               ` Tracy Smith
2018-12-05 17:12                                 ` Borislav Petkov
2018-12-05 17:59                                 ` York Sun
2018-12-05 21:59                                   ` Patrol scrub questions Tracy Smith
2018-12-05 22:12                                     ` York Sun
2018-12-05 22:53                                       ` Layerscape behavior when a UE is detected Tracy Smith
2018-12-05 22:57                                         ` York Sun
2018-12-05 23:41                                           ` Layerscape UE detected and no EDAC panic Tracy Smith
2018-11-19 16:24           ` FW: edac driver initialization, interrupt, & debug York Sun
2018-11-19 15:55         ` York Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).