* Re: edac driver injection of uncorrected errors & utils [not found] ` <AM0PR04MB3971FFCCDAF29E7DF0F0EE159AD10@AM0PR04MB3971.eurprd04.prod.outlook.com> @ 2018-11-28 22:14 ` Tracy Smith 2018-11-28 23:44 ` Borislav Petkov 0 siblings, 1 reply; 10+ messages in thread From: Tracy Smith @ 2018-11-28 22:14 UTC (permalink / raw) To: york.sun; +Cc: linux-edac, util-linux, lkml Nothing appears in the logs or from the edac-util indicating there was a multi-bit UE (uncorrected error). Just a crash and even then I'm not 100% certain it is caused by multi-bit errors without debugging the crash. It happened when writing a 1 to inject_data_lo/inject_data_hi and 0x100 to inject_ctrl. Is there another way of creating an uncorrected error without crashing Linux using the layerscape driver? I would like to see a UE error collected without a Linux crash scenario because I need to validate UEs are being collected. Does the AMD platform, or other memory controllers crash Linux on multi-bit errors and fail to collect uncorrected errors? This is a concern in the field since there is no way of knowing that multi-bit errors occurred and that multi-bit errors caused the crash. For production and in the field, can't have the Linux kernel or layerscape driver crashing the kernel when there are multi-bit errors and not giving any information on what caused the crash in the kernel log. First, it could cost millions in high critical use cases. Second, it is should be preventable. So two concerns/questions: 1. Need a way to validate UE errors are captured without crashing the kernel 2. On multi-bit errors need a way to catch a UE before a kernel crash and ideally prevent the kernel from crashing on multi-bit errors Any recommendations? Scenario produced on an ARM layerscape board. echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_lo echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_hi echo 0x100 > /sys/devices/system/edac/mc/mc0/inject_ctrl [495.327720] CPU: 3 PID: 1239 Comm: sh Not tainted 4.1.35-rt41#1 [ 495.327723] EDAC FSL_DDR MC0: Err Detect Register: 0x80000008 [ 495.327725] Hardware name: LS1043A Board (DT) [ 495.327735] task: ffff800063dd3300 ti: ffff800073358000 task.ti: ffff800073358000 [ 495.327740] PC is at 0x42cf80 [ 495.327742] LR is at 0x42d20c [ 495.327745] pc : [<000000000042cf80>] lr : [<000000000042d20c>] pstate: 20000000 [ 495.327746] sp : ffff80007335bff0 [ 495.327751] x29: 0000ffffd1f0b6e0 x28: 00000000004e0000 [ 495.327756] x27: 000000003cdf81b0 x26: 00000000004d8000 [ 495.327760] x25: 00000000004aea80 x24: 00000000004aea88 [ 495.327764] x23: 00000000004e1000 x22: 00000000004c0e10 [ 495.327768] x21: 00000000004aed98 x20: 00000000004ae868 [ 495.327772] x19: 00000000004ae868 x18: 0000000000000015 [ 495.327776] x17: 0000ffff7a24fb48 x16: 00000000004d8638 [ 495.327781] x15: 002372c270000000 x14: ffffffffffffffff [ 495.327785] x13: 0000000000000018 x12: 0000000000000028 [ 495.327789] x11: 0000000000000038 x10: 0101010101010101 [ 495.327793] x9 : fefefefefefefeff x8 : 000000003ce19f50 [ 495.327797] x7 : 0000ffffd1f0b9e8 x6 : 0000000000000000 [ 495.327801] x5 : 00000000004e1dd0 x4 : 000000003ce19e50 [ 495.327805] x3 : 0000000000000000 x2 : 0000ffffd1f0b7f0 [ 495.327809] x1 : 0000ffffd1f0b7e0 x0 : 00000000004ae868 [ 495.327810] [ 495.327817] Unhandled fault: synchronous external abort (0x96000210) at 0xffff800000e1ec10 On Wed, Nov 28, 2018 at 1:24 PM York Sun <york.sun@nxp.com> wrote: > > Tracy, > > This DDR controller doesn't have the capability to inject limited > errors. As soon as you enable the error injection, all memory > transactions will carry the error. Since multi-bit errors are not > correctable. I don't expect Linux to work properly with these errors. > > York > > > On 11/28/18 1:11 PM, Tracy Smith wrote: > > Thanks York. Why will injecting multi-bit errors crash linux? Is this > > the case only for layerscape? Is there a way to harden against this? > > > > On Wed, Nov 28, 2018 at 1:06 PM York Sun <york.sun@nxp.com> wrote: > >> > >> Tracy, > >> > >> You can inject multiple-bit errors. You will crash the system for doing > >> that. I can't comment on edac-util. > >> > >> York > >> > >> > >> On 11/28/18 12:49 PM, Tracy Smith wrote: > >>> Can I inject a uncorrected error or only corrected errors using the > >>> layerscape edac driver injection via sysfs? > >>> > >>> Is this the expected output for the edac-util on layerscape when > >>> injecting errors? > >>> > >>> root@ls1043ardb:~# edac-util -v > >>> mc0: 0 Uncorrected Errors with no DIMM info > >>> mc0: 0 Corrected Errors with no DIMM info > >>> mc0: csrow0: 0 Uncorrected Errors > >>> mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors > >>> > >>> root@ls1043ardb:~# edac-util -vs > >>> edac-util: EDAC drivers are loaded. 1 MC detected: > >>> mc0:fsl_mc_err > >>> > >>> root@ls1043ardb:~# edac-util > >>> mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors > >>> > >>> Does edac-ctl function on ARM based platforms or only on x86 and why > >>> might it show 0MB for the memory layout for DDR4 as below? > >>> > >>> /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl > >>> --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line > >>> 514. > >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533. > >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533. > >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533. > >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533. > >>> +-----------------------------------------------+ > >>> | mc0 | > >>> | csrow0 | csrow1 | csrow2 | csrow3 | > >>> ----------+-----------------------------------------------+ > >>> channel0: | 0 MB | 0 MB | 0 MB | 0 MB | > >>> ----------+-----------------------------------------------+ > >>> > >> > > > > > > -- > > Confidentiality notice: This e-mail message, including any > > attachments, may contain legally privileged and/or confidential > > information. If you are not the intended recipient(s), please > > immediately notify the sender and delete this e-mail message. > > > -- Confidentiality notice: This e-mail message, including any attachments, may contain legally privileged and/or confidential information. If you are not the intended recipient(s), please immediately notify the sender and delete this e-mail message. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: edac driver injection of uncorrected errors & utils 2018-11-28 22:14 ` edac driver injection of uncorrected errors & utils Tracy Smith @ 2018-11-28 23:44 ` Borislav Petkov 2018-12-05 16:37 ` Tracy Smith 0 siblings, 1 reply; 10+ messages in thread From: Borislav Petkov @ 2018-11-28 23:44 UTC (permalink / raw) To: Tracy Smith; +Cc: york.sun, linux-edac, util-linux, lkml On Wed, Nov 28, 2018 at 04:14:24PM -0600, Tracy Smith wrote: > Is there another way of creating an uncorrected error without crashing > Linux using the layerscape driver? I would like to see a UE error > collected without a Linux crash scenario because I need to validate > UEs are being collected. It depends on whether the hardware is causing the crash on uncorrectable error to prevent data corruption or the error handler is calling panic() or somesuch. If it is the former, then you need to disable that feature - if at all possible (no clue what that platform does). If it is the latter, you can comment out the panic() for testing purposes only and inject then. For an example what x86 does, see "tolerant" here: Documentation/x86/x86_64/machinecheck HTH. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: edac driver injection of uncorrected errors & utils 2018-11-28 23:44 ` Borislav Petkov @ 2018-12-05 16:37 ` Tracy Smith 2018-12-05 17:12 ` Borislav Petkov 2018-12-05 17:59 ` York Sun 0 siblings, 2 replies; 10+ messages in thread From: Tracy Smith @ 2018-12-05 16:37 UTC (permalink / raw) To: bp; +Cc: york.sun, linux-edac, util-linux, lkml This was very helpful. Tracing through the code, it doesn't do a panic before Linux crashes from multi-bit errors because as York has indicated, this type of memory controller doesn't limit the number of errors. I do have a general question about single bit errors. The EDAC driver corrects single bit errors by doing a scrub, is this correct? The edac code does not do periodic scrubs, but I see scrubs when a correctable error is found (edac_mc_scrub_block and edac_atomic_scrub in edac_mc.c)? This is more directed toward York for layerscape. I see some edac code that seem to do periodic scrubs based on intervals or scrub rate, but that is not needed for the layerscape driver to correct errors because errors are scrubbed when found by the edac scrub block or is it because the memory controller itself does the correction/scrubbing when an error is found? thx, Tracy On Wed, Nov 28, 2018 at 5:44 PM Borislav Petkov <bp@alien8.de> wrote: > > On Wed, Nov 28, 2018 at 04:14:24PM -0600, Tracy Smith wrote: > > Is there another way of creating an uncorrected error without crashing > > Linux using the layerscape driver? I would like to see a UE error > > collected without a Linux crash scenario because I need to validate > > UEs are being collected. > > It depends on whether the hardware is causing the crash on uncorrectable > error to prevent data corruption or the error handler is calling panic() > or somesuch. If it is the former, then you need to disable that feature > - if at all possible (no clue what that platform does). > > If it is the latter, you can comment out the panic() for testing > purposes only and inject then. For an example what x86 does, see > "tolerant" here: > > Documentation/x86/x86_64/machinecheck > > HTH. > > -- > Regards/Gruss, > Boris. > > Good mailing practices for 400: avoid top-posting and trim the reply. -- Confidentiality notice: This e-mail message, including any attachments, may contain legally privileged and/or confidential information. If you are not the intended recipient(s), please immediately notify the sender and delete this e-mail message. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: edac driver injection of uncorrected errors & utils 2018-12-05 16:37 ` Tracy Smith @ 2018-12-05 17:12 ` Borislav Petkov 2018-12-05 17:59 ` York Sun 1 sibling, 0 replies; 10+ messages in thread From: Borislav Petkov @ 2018-12-05 17:12 UTC (permalink / raw) To: Tracy Smith; +Cc: york.sun, linux-edac, util-linux, lkml On Wed, Dec 05, 2018 at 10:37:52AM -0600, Tracy Smith wrote: > This was very helpful. I'm glad. Can you do me a favor pls and not top-post when replying on a mailing list? Thx. > Tracing through the code, it doesn't do a panic > before Linux crashes from multi-bit errors because as York has > indicated, this type of memory controller doesn't limit the number of > errors. > > I do have a general question about single bit errors. The EDAC driver > corrects single bit errors by doing a scrub, is this correct? The > edac code does not do periodic scrubs, but I see scrubs when a > correctable error is found (edac_mc_scrub_block and edac_atomic_scrub > in edac_mc.c)? > > This is more directed toward York for layerscape. Yes, this is all platform-specific as you can see that some arches implement that atomic scrubbing thing. Also, not every driver sets mci->scrub_mode == SCRUB_SW_SRC in order to even do the scrubbing. HTH. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: edac driver injection of uncorrected errors & utils 2018-12-05 16:37 ` Tracy Smith 2018-12-05 17:12 ` Borislav Petkov @ 2018-12-05 17:59 ` York Sun 2018-12-05 21:59 ` Patrol scrub questions Tracy Smith 1 sibling, 1 reply; 10+ messages in thread From: York Sun @ 2018-12-05 17:59 UTC (permalink / raw) To: Tracy Smith, bp; +Cc: linux-edac, util-linux, lkml On 12/5/18 8:38 AM, Tracy Smith wrote: > This is more directed toward York for layerscape. I see some edac code > that seem to do periodic scrubs based on intervals or scrub rate, but > that is not needed for the layerscape driver to correct errors because > errors are scrubbed when found by the edac scrub block or is it > because the memory controller itself does the correction/scrubbing > when an error is found? Single-bit errors are corrected by memory controller without involving software. York ^ permalink raw reply [flat|nested] 10+ messages in thread
* Patrol scrub questions 2018-12-05 17:59 ` York Sun @ 2018-12-05 21:59 ` Tracy Smith 2018-12-05 22:12 ` York Sun 0 siblings, 1 reply; 10+ messages in thread From: Tracy Smith @ 2018-12-05 21:59 UTC (permalink / raw) To: york.sun; +Cc: bp, linux-edac, util-linux, lkml >Single-bit errors are corrected by memory controller without involving software. Sorry for being verbose, but I need to explain the reason for the questions below since I need to determine if a memory scrub is required on layerscape and why. There are multiple layers to the problem of ECC. First layer, there is the immediate 'correction' of a flipped bit. This does not 'fix' the source of the error but corrects the flipped bit for use by the processor. Most bit flips will be due to either a transitory noise problem on the bus, which will not be associated with any given memory cell, OR it will be due to a cosmic-ray induced bit flip in the memory cell which will stay 'flipped' until the location has been written to again. The safe action is to write the ECC corrected data back to the same 'error' location in memory. Does the layerscape memory controller without software intervention do this? Question 1) Does the layerscape memory controller automatically perform a write of the corrected data back to the 'error' location to make a correction? If not, is a patrol scrub required to do this? Second layer, there is the risk of a double bit flip in memory. Statistically this is very rare, but the odds significantly increase that a double bit flip will occur in a single word when a single bit flip goes uncorrected, giving more time for another cosmic ray induced bit flip to occur in that word. The layerscape memory controller can only detect a bit-flip when a given location is read, correct? This is different from normal DRAM refresh routines. If a location is not normally read, it can go 'unserviced' indefinitely, allowing multiple bit flips to accumulate. By periodically (once a day should be more than sufficient overkill) reading each location in the DRAM and writing that same (automatically ECC corrected if correction was needed) value back into the DRAM, we drastically reduce the potential for an uncorrectable multiple bit error to accumulate in any given word in memory. Question 2) Again this would require the EDAC layerscape driver to do a control scrub, correct? If not, how is this handled by the memory controller to avoid the need for a patrol scrub? Third layer, there is how the memory controller handles UE errors. My understanding is that the layerscape memory controller, can detect if it is a single bit (correctable) error or a multi-bit error that is not correctable. Is this the case? An uncorrectable error in the data or the software will have consequences ranging from negligible to critical. From a hardware standpoint it can't tell if it is critical so it must assume it is. Question 3) Because the memory controller or layerscape platform must assume a UE is critical, will a single UE on layersape cause a WDT to be triggered and a reset to occur? Question 4) If so, will a panic ever be called if there is a hardware uncorrectable memory failure? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Patrol scrub questions 2018-12-05 21:59 ` Patrol scrub questions Tracy Smith @ 2018-12-05 22:12 ` York Sun 2018-12-05 22:53 ` Layerscape behavior when a UE is detected Tracy Smith 0 siblings, 1 reply; 10+ messages in thread From: York Sun @ 2018-12-05 22:12 UTC (permalink / raw) To: Tracy Smith; +Cc: bp, linux-edac, util-linux, lkml On 12/5/18 2:00 PM, Tracy Smith wrote: >> Single-bit errors are corrected by memory controller without involving software. > > Sorry for being verbose, but I need to explain the reason for the > questions below since I need to determine if a memory scrub is > required on layerscape and why. There are multiple layers to the > problem of ECC. > > First layer, there is the immediate 'correction' of a flipped bit. > > This does not 'fix' the source of the error but corrects the flipped > bit for use by the processor. > > Most bit flips will be due to either a transitory noise problem on the > bus, which will not be associated with any given memory cell, OR it > will be due to a cosmic-ray induced bit flip in the memory cell which > will stay 'flipped' until the location has been written to again. > > The safe action is to write the ECC corrected data back to the same > 'error' location in memory. Does the layerscape memory controller > without software intervention do this? > > Question 1) Does the layerscape memory controller automatically > perform a write of the corrected data back to the 'error' location to > make a correction? If not, is a patrol scrub required to do this? > Tracy, Layerscape SoCs have the feature to fix any detected single-bit errors. It is not part of EDAC driver. The error is still counted so EDAC driver can "see" this error. You can refer to SoC reference manual. > Question 3) Because the memory controller or layerscape platform must > assume a UE is critical, will a single UE on layersape cause a WDT to > be triggered and a reset to occur? No. > > Question 4) If so, will a panic ever be called if there is a hardware > uncorrectable memory failure? No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs. York ^ permalink raw reply [flat|nested] 10+ messages in thread
* Layerscape behavior when a UE is detected 2018-12-05 22:12 ` York Sun @ 2018-12-05 22:53 ` Tracy Smith 2018-12-05 22:57 ` York Sun 0 siblings, 1 reply; 10+ messages in thread From: Tracy Smith @ 2018-12-05 22:53 UTC (permalink / raw) To: york.sun; +Cc: bp, linux-edac, util-linux, lkml >> Question 4) If so, will a panic ever be called if there is a hardware >> uncorrectable memory failure? >No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs. Just to be clear, the upper layer of the EDAC driver will or will not panic when a UE is detected on layerscape? If there is no panic by the upper layer and no reset triggered by the layerscape CPLD or memory controller, what happens on layerscape when a UE is detected by the memory controller? Forcing a UE by grounding a dataline caused a reset on layerscape after a few seconds, but no panic. It is unclear why it reset, but it appears as though a WDT was tripped. The UE was reported by EDAC and seen in the log. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Layerscape behavior when a UE is detected 2018-12-05 22:53 ` Layerscape behavior when a UE is detected Tracy Smith @ 2018-12-05 22:57 ` York Sun 2018-12-05 23:41 ` Layerscape UE detected and no EDAC panic Tracy Smith 0 siblings, 1 reply; 10+ messages in thread From: York Sun @ 2018-12-05 22:57 UTC (permalink / raw) To: Tracy Smith; +Cc: bp, linux-edac, util-linux, lkml On 12/5/18 2:54 PM, Tracy Smith wrote: >>> Question 4) If so, will a panic ever be called if there is a hardware >>> uncorrectable memory failure? > >> No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs. > > Just to be clear, the upper layer of the EDAC driver will or will not > panic when a UE is detected on layerscape? > > If there is no panic by the upper layer and no reset triggered by the > layerscape CPLD or memory controller, what happens on layerscape when > a UE is detected by the memory controller? > > Forcing a UE by grounding a dataline caused a reset on layerscape > after a few seconds, but no panic. It is unclear why it reset, but it > appears as though a WDT was tripped. The UE was reported by EDAC and > seen in the log. > I can't help you on that. I never tried to force errors by grounding the signals. You have read the driver. Do you see panic? The idea is to report the error and let upper layer to decide what to do. Sometimes limping forward is better than reset or panic. Again, it is not driver's responsibility. York ^ permalink raw reply [flat|nested] 10+ messages in thread
* Layerscape UE detected and no EDAC panic 2018-12-05 22:57 ` York Sun @ 2018-12-05 23:41 ` Tracy Smith 0 siblings, 0 replies; 10+ messages in thread From: Tracy Smith @ 2018-12-05 23:41 UTC (permalink / raw) To: york.sun; +Cc: bp, linux-edac, util-linux, lkml > I can't help you on that. I never tried to force errors by grounding the > signals. You have read the driver. Do you see panic? The idea is to > report the error and let upper layer to decide what to do. Sometimes > limping forward is better than reset or panic. Again, it is not driver's > responsibility. Thanks for the clarification York. Yes there is panic code in the EDAC upper layer, but no panic occurred. A UE was printed on the serial console, and the layerscape board reset. The reason it did not panic is because edac_mc_panic_on_ue has to be set at runtime. Just validated this will cause a panic when set. No memory UE should reset the board, so this was caused because of grounding the data line and an issue with how I'm testing for a UE not related to a UE itself. echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue this is the way to force a panic on a UE error. MODULE_PARM_DESC(edac_mc_panic_on_ue, "Panic on uncorrected error: 0=off 1=on"); So, this is validated. Produced a UE and was able to avoid a panic and I was able to induce a panic on a UE. I'm satisfied with this. thanks again!! ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-12-05 23:41 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <BYAPR02MB431115EC4735AE5B7E29F2CEF6DC0@BYAPR02MB4311.namprd02.prod.outlook.com> [not found] ` <BYAPR02MB43110062F32BFDEA712AB371F6DC0@BYAPR02MB4311.namprd02.prod.outlook.com> [not found] ` <CAChUvXMp6S6MBY_LmrfgdPcctQw70FoyxbiHeFqK+5fQx5omCw@mail.gmail.com> [not found] ` <CAChUvXP6eu76xqEZNspooUMb+311mmDH8=St=awCL77hJPus9Q@mail.gmail.com> [not found] ` <20181117140513.GA4944@zn.tnic> [not found] ` <CAChUvXNO_8Frw1igaEAHSxmdtTy7MJVm8W1NpUqZ6tFD0hXwhA@mail.gmail.com> [not found] ` <0BF2A47F-7F33-4E4D-A566-23AF2F4CCD52@theinkpens.com> [not found] ` <CAChUvXMVHxhawLFPFzz_0+iFxjQ+dRwpTsCGo95oc8Y+7a-2DQ@mail.gmail.com> [not found] ` <AM0PR04MB3971FBCF6CB23BE778D39EED9AD80@AM0PR04MB3971.eurprd04.prod.outlook.com> [not found] ` <CAChUvXPCfwfHrntJHWpsydYZE=P692Axd0pFE+GjZCXtx1fgog@mail.gmail.com> [not found] ` <CAChUvXMWZ-LYyqnczM-bt9cDP0r1XT+F1dcYuRHiVcx=QR7_Jw@mail.gmail.com> [not found] ` <AM0PR04MB3971768EA5D50D7045B462619AD10@AM0PR04MB3971.eurprd04.prod.outlook.com> [not found] ` <CAChUvXN8rZqxBaV2qbdR8uymsmZAk_Jnc2kxSUf+kBf76QHV9A@mail.gmail.com> [not found] ` <AM0PR04MB3971FFCCDAF29E7DF0F0EE159AD10@AM0PR04MB3971.eurprd04.prod.outlook.com> 2018-11-28 22:14 ` edac driver injection of uncorrected errors & utils Tracy Smith 2018-11-28 23:44 ` Borislav Petkov 2018-12-05 16:37 ` Tracy Smith 2018-12-05 17:12 ` Borislav Petkov 2018-12-05 17:59 ` York Sun 2018-12-05 21:59 ` Patrol scrub questions Tracy Smith 2018-12-05 22:12 ` York Sun 2018-12-05 22:53 ` Layerscape behavior when a UE is detected Tracy Smith 2018-12-05 22:57 ` York Sun 2018-12-05 23:41 ` Layerscape UE detected and no EDAC panic Tracy Smith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).