From: James Morse <james.morse@arm.com> To: "Hawa, Hanna" <hhhawa@amazon.com> Cc: robh+dt@kernel.org, mark.rutland@arm.com, bp@alien8.de, mchehab@kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org, nicolas.ferre@microchip.com, paulmck@linux.ibm.com, dwmw@amazon.co.uk, benh@amazon.com, ronenk@amazon.com, talel@amazon.com, jonnyc@amazon.com, hanochu@amazon.com, linux-edac@vger.kernel.org, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC Date: Wed, 19 Jun 2019 18:22:37 +0100 [thread overview] Message-ID: <44da6863-eb79-a61b-a4bf-9e8c6cacc2b8@arm.com> (raw) In-Reply-To: <bbb9b41d-8ffa-d4c5-c199-2400695cce8d@amazon.com> Hi Hawa, On 17/06/2019 14:00, Hawa, Hanna wrote: >> I don't think it can, on a second reading, it looks to be even more complicated than I >> thought! That bit is described as disabling forwarding of uncorrected data, but it looks >> like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush' >> means in this context.) >> I was looking for reasons you could 'know' that any reported error was corrected. This was >> just a bad suggestion! > Is there interrupt for un-correctable error? The answer here is somewhere between 'not really' and 'maybe'. There is a signal you may have wired-up as an interrupt, but its not usable from linux. A.8.2 "Asychronous error signals" of the A57 TRM [0] has: | nINTERRIRQ output Error indicator for an L2 RAM double-bit ECC error. ("7.6 Asynchronous errors" has more on this). Errors cause L2ECTLR[30] to get set, and this value output as a signal, you may have wired it up as an interrupt. If you did, beware its level sensitive, and can only be cleared by writing to L2ECTLR_EL1. You shouldn't allow linux to access this register as it could mess with the L2 configuration, which could also affect your EL3 and any secure-world software. The arrival of this interrupt doesn't tell you which L2 tripped the error, and you can only clear it if you write to L2ECTLR_EL1 on a CPU attached to the right L2. So this isn't actually a shared (peripheral) interrupt. This stuff is expected to be used by firmware, which can know the affinity constraints of signals coming in as interrupts. > Does 'asynchronous errors' in L2 used to report UE? From "7.2.4 Error correction code" single-bit errors are always corrected. A.8.2 quoted above gives the behaviour for double-bit errors. > In case no interrupt, can we use die-notifier subsystem to check if any error had occur > while system shutdown? notify_die() would imply a synchronous exception that killed a thread. SError are a whole lot worse. Before v8.2 these are all treated as 'uncontained': unknown memory corruption. Which in your L2 case is exactly what happened. The arch code will panic(). If your driver can print something useful to help debug the panic(), then a panic_notifier sounds appropriate. But you can't rely on these notifiers being called, as kdump has some hooks that affect if/when they run. (KVM will 'contain' SError that come from a guest to the guest, as we know a distinct set of memory was in use. You may see fatal error counters increasing without the system panic()ing) contained/uncontained is part of the terminology from the v8.2 RAS spec [1]. Thanks, James [0] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf [1] https://static.docs.arm.com/ddi0587/ca/ARM_DDI_0587C_a_RAS.pdf?_ga=2.148234679.1686960568.1560964184-897392434.1556719556
WARNING: multiple messages have this Message-ID (diff)
From: James Morse <james.morse@arm.com> To: "Hawa, Hanna" <hhhawa@amazon.com> Cc: robh+dt@kernel.org, mark.rutland@arm.com, bp@alien8.de, mchehab@kernel.org, davem@davemloft.net, gregkh@linuxfoundation.org, nicolas.ferre@microchip.com, paulmck@linux.ibm.com, dwmw@amazon.co.uk, benh@amazon.com, ronenk@amazon.com, talel@amazon.com, jonnyc@amazon.com, hanochu@amazon.com, linux-edac@vger.kernel.org, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC Date: Wed, 19 Jun 2019 18:22:37 +0100 [thread overview] Message-ID: <44da6863-eb79-a61b-a4bf-9e8c6cacc2b8@arm.com> (raw) In-Reply-To: <bbb9b41d-8ffa-d4c5-c199-2400695cce8d@amazon.com> Hi Hawa, On 17/06/2019 14:00, Hawa, Hanna wrote: >> I don't think it can, on a second reading, it looks to be even more complicated than I >> thought! That bit is described as disabling forwarding of uncorrected data, but it looks >> like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush' >> means in this context.) >> I was looking for reasons you could 'know' that any reported error was corrected. This was >> just a bad suggestion! > Is there interrupt for un-correctable error? The answer here is somewhere between 'not really' and 'maybe'. There is a signal you may have wired-up as an interrupt, but its not usable from linux. A.8.2 "Asychronous error signals" of the A57 TRM [0] has: | nINTERRIRQ output Error indicator for an L2 RAM double-bit ECC error. ("7.6 Asynchronous errors" has more on this). Errors cause L2ECTLR[30] to get set, and this value output as a signal, you may have wired it up as an interrupt. If you did, beware its level sensitive, and can only be cleared by writing to L2ECTLR_EL1. You shouldn't allow linux to access this register as it could mess with the L2 configuration, which could also affect your EL3 and any secure-world software. The arrival of this interrupt doesn't tell you which L2 tripped the error, and you can only clear it if you write to L2ECTLR_EL1 on a CPU attached to the right L2. So this isn't actually a shared (peripheral) interrupt. This stuff is expected to be used by firmware, which can know the affinity constraints of signals coming in as interrupts. > Does 'asynchronous errors' in L2 used to report UE? >From "7.2.4 Error correction code" single-bit errors are always corrected. A.8.2 quoted above gives the behaviour for double-bit errors. > In case no interrupt, can we use die-notifier subsystem to check if any error had occur > while system shutdown? notify_die() would imply a synchronous exception that killed a thread. SError are a whole lot worse. Before v8.2 these are all treated as 'uncontained': unknown memory corruption. Which in your L2 case is exactly what happened. The arch code will panic(). If your driver can print something useful to help debug the panic(), then a panic_notifier sounds appropriate. But you can't rely on these notifiers being called, as kdump has some hooks that affect if/when they run. (KVM will 'contain' SError that come from a guest to the guest, as we know a distinct set of memory was in use. You may see fatal error counters increasing without the system panic()ing) contained/uncontained is part of the terminology from the v8.2 RAS spec [1]. Thanks, James [0] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf [1] https://static.docs.arm.com/ddi0587/ca/ARM_DDI_0587C_a_RAS.pdf?_ga=2.148234679.1686960568.1560964184-897392434.1556719556
next prev parent reply other threads:[~2019-06-19 17:22 UTC|newest] Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-05-30 10:15 [PATCH 0/2] Add support for Amazon's Annapurna Labs EDAC for L1/L2 Hanna Hawa 2019-05-30 10:15 ` Hanna Hawa 2019-05-30 10:15 ` [PATCH 1/2] dt-bindings: EDAC: add Amazon Annapurna Labs EDAC binding Hanna Hawa 2019-05-30 10:15 ` Hanna Hawa 2019-05-30 11:54 ` Greg KH 2019-05-31 0:35 ` Borislav Petkov 2019-06-03 7:24 ` Woodhouse, David 2019-05-30 10:15 ` [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC Hanna Hawa 2019-05-30 10:15 ` Hanna Hawa 2019-05-30 11:57 ` Greg KH 2019-05-30 12:52 ` hhhawa 2019-05-30 12:52 ` hhhawa 2019-05-30 13:04 ` Joe Perches 2019-05-30 18:19 ` Boris Petkov 2019-05-30 18:19 ` Boris Petkov 2019-05-31 1:15 ` Herrenschmidt, Benjamin 2019-05-31 5:14 ` Borislav Petkov 2019-06-05 15:13 ` James Morse 2019-06-06 7:53 ` Hawa, Hanna 2019-06-06 10:03 ` Borislav Petkov 2019-06-06 10:03 ` Borislav Petkov 2019-06-06 10:33 ` James Morse 2019-06-06 11:22 ` Borislav Petkov 2019-06-06 11:22 ` Borislav Petkov 2019-06-06 11:37 ` Shenhar, Talel 2019-06-07 15:11 ` James Morse 2019-06-07 15:11 ` James Morse 2019-06-08 0:22 ` Benjamin Herrenschmidt 2019-06-08 0:16 ` Benjamin Herrenschmidt 2019-06-08 9:05 ` Borislav Petkov 2019-06-08 9:05 ` Borislav Petkov 2019-06-11 5:50 ` Benjamin Herrenschmidt 2019-06-11 5:50 ` Benjamin Herrenschmidt 2019-06-11 7:21 ` Benjamin Herrenschmidt 2019-06-11 7:21 ` Benjamin Herrenschmidt 2019-06-11 11:56 ` Borislav Petkov 2019-06-11 11:56 ` Borislav Petkov 2019-06-11 22:25 ` Benjamin Herrenschmidt 2019-06-11 22:25 ` Benjamin Herrenschmidt 2019-06-12 3:48 ` Borislav Petkov 2019-06-12 8:29 ` Benjamin Herrenschmidt 2019-06-12 10:42 ` Borislav Petkov 2019-06-12 10:42 ` Borislav Petkov 2019-06-12 23:54 ` Benjamin Herrenschmidt 2019-06-12 23:54 ` Benjamin Herrenschmidt 2019-06-13 7:44 ` Borislav Petkov 2019-06-13 7:44 ` Borislav Petkov 2019-06-14 10:53 ` Borislav Petkov 2019-06-14 10:53 ` Borislav Petkov 2019-06-12 10:42 ` Mauro Carvalho Chehab 2019-06-12 11:00 ` Borislav Petkov 2019-06-12 11:00 ` Borislav Petkov 2019-06-12 11:42 ` Mauro Carvalho Chehab 2019-06-12 11:42 ` Mauro Carvalho Chehab 2019-06-12 11:57 ` Benjamin Herrenschmidt 2019-06-12 12:25 ` Borislav Petkov 2019-06-12 12:25 ` Borislav Petkov 2019-06-12 12:35 ` Hawa, Hanna 2019-06-12 15:34 ` Borislav Petkov 2019-06-12 15:34 ` Borislav Petkov 2019-06-12 23:57 ` Benjamin Herrenschmidt 2019-06-12 23:57 ` Benjamin Herrenschmidt 2019-06-12 23:56 ` Benjamin Herrenschmidt 2019-06-11 7:29 ` Hawa, Hanna 2019-06-11 11:59 ` Borislav Petkov 2019-06-11 11:59 ` Borislav Petkov 2019-06-11 11:47 ` Borislav Petkov 2019-06-11 11:47 ` Borislav Petkov 2019-06-03 6:56 ` Hawa, Hanna 2019-06-05 15:16 ` James Morse 2019-06-11 19:56 ` Hawa, Hanna 2019-06-11 19:56 ` Hawa, Hanna 2019-06-13 17:05 ` James Morse 2019-06-14 10:49 ` James Morse 2019-06-17 13:00 ` Hawa, Hanna 2019-06-17 13:00 ` Hawa, Hanna 2019-06-19 17:22 ` James Morse [this message] 2019-06-19 17:22 ` James Morse
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=44da6863-eb79-a61b-a4bf-9e8c6cacc2b8@arm.com \ --to=james.morse@arm.com \ --cc=benh@amazon.com \ --cc=bp@alien8.de \ --cc=davem@davemloft.net \ --cc=devicetree@vger.kernel.org \ --cc=dwmw@amazon.co.uk \ --cc=gregkh@linuxfoundation.org \ --cc=hanochu@amazon.com \ --cc=hhhawa@amazon.com \ --cc=jonnyc@amazon.com \ --cc=linux-edac@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mark.rutland@arm.com \ --cc=mchehab@kernel.org \ --cc=nicolas.ferre@microchip.com \ --cc=paulmck@linux.ibm.com \ --cc=robh+dt@kernel.org \ --cc=ronenk@amazon.com \ --cc=talel@amazon.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.