util-linux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tracy Smith <tlsmith3777@gmail.com>
To: york.sun@nxp.com
Cc: linux-edac@vger.kernel.org, util-linux@vger.kernel.org,
	lkml <linux-kernel@vger.kernel.org>
Subject: Re: edac driver injection of uncorrected errors & utils
Date: Wed, 28 Nov 2018 16:14:24 -0600	[thread overview]
Message-ID: <CAChUvXNz=XR37EeVRJFnF195BGgjPgCeFv6nbjR3yV2_mQwxBQ@mail.gmail.com> (raw)
In-Reply-To: <AM0PR04MB3971FFCCDAF29E7DF0F0EE159AD10@AM0PR04MB3971.eurprd04.prod.outlook.com>

Nothing appears in the logs or from the edac-util indicating there was
a multi-bit UE (uncorrected error). Just a crash and even then I'm not
100% certain it is caused by multi-bit errors without debugging the
crash. It happened when writing a 1 to inject_data_lo/inject_data_hi
and 0x100 to inject_ctrl.

Is there another way of creating an uncorrected error without crashing
Linux using the layerscape driver? I would like to see a UE error
collected without a Linux crash scenario because I need to validate
UEs are being collected.

Does the AMD platform, or other memory controllers crash Linux on
multi-bit errors and fail to collect uncorrected errors?  This is a
concern in the field since there is no way of knowing that multi-bit
errors occurred and that multi-bit errors caused the crash.

For production and in the field, can't have the Linux kernel or
layerscape driver crashing the kernel when there are multi-bit errors
and not giving any information on what caused the crash in the kernel
log. First, it could cost millions in high critical use cases.
Second, it is should be preventable.

So two concerns/questions:

1. Need a way to validate UE errors are captured without crashing the kernel
2. On multi-bit errors need a way to catch a UE before a kernel crash
and ideally prevent the kernel from crashing on multi-bit errors

Any recommendations?

Scenario produced on an ARM layerscape board.

echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_lo
echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_hi
echo 0x100 > /sys/devices/system/edac/mc/mc0/inject_ctrl

[495.327720] CPU: 3 PID: 1239 Comm: sh Not tainted 4.1.35-rt41#1
[  495.327723] EDAC FSL_DDR MC0: Err Detect Register: 0x80000008
[  495.327725] Hardware name: LS1043A Board (DT)
[  495.327735] task: ffff800063dd3300 ti: ffff800073358000 task.ti:
ffff800073358000
[  495.327740] PC is at 0x42cf80
[  495.327742] LR is at 0x42d20c
[  495.327745] pc : [<000000000042cf80>] lr : [<000000000042d20c>]
pstate: 20000000
[  495.327746] sp : ffff80007335bff0
[  495.327751] x29: 0000ffffd1f0b6e0 x28: 00000000004e0000
[  495.327756] x27: 000000003cdf81b0 x26: 00000000004d8000
[  495.327760] x25: 00000000004aea80 x24: 00000000004aea88
[  495.327764] x23: 00000000004e1000 x22: 00000000004c0e10
[  495.327768] x21: 00000000004aed98 x20: 00000000004ae868
[  495.327772] x19: 00000000004ae868 x18: 0000000000000015
[  495.327776] x17: 0000ffff7a24fb48 x16: 00000000004d8638
[  495.327781] x15: 002372c270000000 x14: ffffffffffffffff
[  495.327785] x13: 0000000000000018 x12: 0000000000000028
[  495.327789] x11: 0000000000000038 x10: 0101010101010101
[  495.327793] x9 : fefefefefefefeff x8 : 000000003ce19f50
[  495.327797] x7 : 0000ffffd1f0b9e8 x6 : 0000000000000000
[  495.327801] x5 : 00000000004e1dd0 x4 : 000000003ce19e50
[  495.327805] x3 : 0000000000000000 x2 : 0000ffffd1f0b7f0
[  495.327809] x1 : 0000ffffd1f0b7e0 x0 : 00000000004ae868
[  495.327810]
[  495.327817] Unhandled fault: synchronous external abort
(0x96000210) at 0xffff800000e1ec10
On Wed, Nov 28, 2018 at 1:24 PM York Sun <york.sun@nxp.com> wrote:
>
> Tracy,
>
> This DDR controller doesn't have the capability to inject limited
> errors. As soon as you enable the error injection, all memory
> transactions will carry the error. Since multi-bit errors are not
> correctable. I don't expect Linux to work properly with these errors.
>
> York
>
>
> On 11/28/18 1:11 PM, Tracy Smith wrote:
> > Thanks York. Why will injecting multi-bit errors crash linux?  Is this
> > the case only for layerscape?  Is there a way to harden against this?
> >
> > On Wed, Nov 28, 2018 at 1:06 PM York Sun <york.sun@nxp.com> wrote:
> >>
> >> Tracy,
> >>
> >> You can inject multiple-bit errors. You will crash the system for doing
> >> that. I can't comment on edac-util.
> >>
> >> York
> >>
> >>
> >> On 11/28/18 12:49 PM, Tracy Smith wrote:
> >>> Can I inject a uncorrected error or only corrected errors using the
> >>> layerscape edac driver injection via sysfs?
> >>>
> >>> Is this the expected output for the edac-util on layerscape when
> >>> injecting errors?
> >>>
> >>> root@ls1043ardb:~# edac-util -v
> >>> mc0: 0 Uncorrected Errors with no DIMM info
> >>> mc0: 0 Corrected Errors with no DIMM info
> >>> mc0: csrow0: 0 Uncorrected Errors
> >>> mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors
> >>>
> >>> root@ls1043ardb:~# edac-util -vs
> >>> edac-util: EDAC drivers are loaded. 1 MC detected:
> >>> mc0:fsl_mc_err
> >>>
> >>> root@ls1043ardb:~# edac-util
> >>> mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors
> >>>
> >>> Does edac-ctl function on ARM based platforms or only on x86 and why
> >>> might it show 0MB for the memory layout for DDR4 as below?
> >>>
> >>> /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
> >>> --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
> >>> 514.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>>           +-----------------------------------------------+
> >>>           |                      mc0                      |
> >>>           |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
> >>> ----------+-----------------------------------------------+
> >>> channel0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
> >>> ----------+-----------------------------------------------+
> >>>
> >>
> >
> >
> > --
> > Confidentiality notice: This e-mail message, including any
> > attachments, may contain legally privileged and/or confidential
> > information. If you are not the intended recipient(s), please
> > immediately notify the sender and delete this e-mail message.
> >
>


-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.

  reply	other threads:[~2018-11-28 22:14 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <BYAPR02MB431115EC4735AE5B7E29F2CEF6DC0@BYAPR02MB4311.namprd02.prod.outlook.com>
     [not found] ` <BYAPR02MB43110062F32BFDEA712AB371F6DC0@BYAPR02MB4311.namprd02.prod.outlook.com>
     [not found]   ` <CAChUvXMp6S6MBY_LmrfgdPcctQw70FoyxbiHeFqK+5fQx5omCw@mail.gmail.com>
2018-11-16 17:07     ` FW: edac driver initialization, interrupt, & debug Tracy Smith
2018-11-17 14:05       ` Borislav Petkov
2018-11-17 23:22         ` Tracy Smith
2018-11-18  1:05           ` Steve Inkpen
2018-11-19 16:37             ` Tracy Smith
2018-11-19 16:48               ` York Sun
2018-11-21 17:01                 ` Tracy Smith
2018-11-21 22:02                   ` Tracy Smith
2018-11-28 18:48                   ` edac driver injection of uncorrected errors & utils Tracy Smith
2018-11-28 19:06                     ` York Sun
2018-11-28 19:11                       ` Tracy Smith
2018-11-28 19:24                         ` York Sun
2018-11-28 22:14                           ` Tracy Smith [this message]
2018-11-28 23:44                             ` Borislav Petkov
2018-12-05 16:37                               ` Tracy Smith
2018-12-05 17:12                                 ` Borislav Petkov
2018-12-05 17:59                                 ` York Sun
2018-12-05 21:59                                   ` Patrol scrub questions Tracy Smith
2018-12-05 22:12                                     ` York Sun
2018-12-05 22:53                                       ` Layerscape behavior when a UE is detected Tracy Smith
2018-12-05 22:57                                         ` York Sun
2018-12-05 23:41                                           ` Layerscape UE detected and no EDAC panic Tracy Smith
2018-11-19 16:24           ` FW: edac driver initialization, interrupt, & debug York Sun
2018-11-19 15:55         ` York Sun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAChUvXNz=XR37EeVRJFnF195BGgjPgCeFv6nbjR3yV2_mQwxBQ@mail.gmail.com' \
    --to=tlsmith3777@gmail.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=util-linux@vger.kernel.org \
    --cc=york.sun@nxp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).