From: James Morse <james.morse@arm.com> To: "Baicar, Tyler" <tbaicar@codeaurora.org> Cc: Marc Zyngier <marc.zyngier@arm.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com>, kvmarm@lists.cs.columbia.edu, Wang Xiongfeng <wangxiongfeng2@huawei.com>, linux-arm-kernel@lists.infradead.org Subject: Re: [PATCH v2 10/16] arm64: kernel: Survive corrected RAS errors notified by SError Date: Thu, 14 Sep 2017 13:58:54 +0100 [thread overview] Message-ID: <59BA7D0E.6020500@arm.com> (raw) In-Reply-To: <eba92679-bbb7-6cb7-843c-7cfdbc793b6b@codeaurora.org> Hi Tyler, On 13/09/17 21:52, Baicar, Tyler wrote: > On 7/28/2017 8:10 AM, James Morse wrote: >> On v8.0, SError is an uncontainable fatal exception. The v8.2 RAS >> extensions use SError to notify software about RAS errors, these can be >> contained by the ESB instruction. >> >> An ACPI system with firmware-first may use SError as its 'SEI' >> notification. Future patches may add code to 'claim' this SError as >> notification. >> >> Other systems can distinguish these RAS errors from the SError ESR and >> use the AET bits and additional data from RAS-Error registers to handle >> the error. Future patches may add this kernel-first handling. >> >> In the meantime, on both kinds of system we can safely ignore corrected >> errors. > Here you just have corrected and restartable errors being ignored and all other > errors panic. For corrected and restartable errors, we should at least be > logging that an error happened and provide the syndrome info (address, context, > etc.). Yes, that would be great, but its all wrapped up in 'kernel first handling' for RAS... which we don't yet have. This series is 'fixing' the kernel's SError mask behaviour so that the SEI firmware-first mechanism can (almost) always deliver its notifications, and has somewhere to hook the APEI code into, (like you did for do_sea()). Of course not all systems will have this firmware, so if we took a v8.2 RAS SError on bare-metal we need to do something. This selective-ignoring is an interim fudge to avoid bringing the machine down for something that isn't (yet?) a problem. >From the commit message: > Future patches may add this kernel-first handling. > In the meantime, on both kinds of system we can safely ignore corrected > errors. > We also should be triggering a trace event to notify the user space that > an error happened so that tools like RAS Daemon can report the error. This will > involve a new trace event since the current ones are based of the CPER > structures from the firmware-first case. Hmm, so RAS Daemon is going to end up knowing whether an error was handled kernel-first or firmware-first, that is unfortunate for RAS-Daemon (more code) and means we have duplicate trace points. > Recoverable UEs should not need to trigger the panic, we should be able to do > the recovery similar to the memory fault handling in mm/memory-failure.c code. > The recoverable UEs should also trigger a trace event to user space since they > won't cause a panic as well. I agree, but only once we have code to dig in v8.2's RAS ERR registers to pick out the class of error and affected component or address. Until then we can't know the component or address, so can't handle the error. This is still an improvement over a non-v8.2-RAS aware kernel, as that would panic() for corrected errors too, (depending on when they arrived ... the SError masking is somewhat broken). Thanks, James
WARNING: multiple messages have this Message-ID (diff)
From: james.morse@arm.com (James Morse) To: linux-arm-kernel@lists.infradead.org Subject: [PATCH v2 10/16] arm64: kernel: Survive corrected RAS errors notified by SError Date: Thu, 14 Sep 2017 13:58:54 +0100 [thread overview] Message-ID: <59BA7D0E.6020500@arm.com> (raw) In-Reply-To: <eba92679-bbb7-6cb7-843c-7cfdbc793b6b@codeaurora.org> Hi Tyler, On 13/09/17 21:52, Baicar, Tyler wrote: > On 7/28/2017 8:10 AM, James Morse wrote: >> On v8.0, SError is an uncontainable fatal exception. The v8.2 RAS >> extensions use SError to notify software about RAS errors, these can be >> contained by the ESB instruction. >> >> An ACPI system with firmware-first may use SError as its 'SEI' >> notification. Future patches may add code to 'claim' this SError as >> notification. >> >> Other systems can distinguish these RAS errors from the SError ESR and >> use the AET bits and additional data from RAS-Error registers to handle >> the error. Future patches may add this kernel-first handling. >> >> In the meantime, on both kinds of system we can safely ignore corrected >> errors. > Here you just have corrected and restartable errors being ignored and all other > errors panic. For corrected and restartable errors, we should at least be > logging that an error happened and provide the syndrome info (address, context, > etc.). Yes, that would be great, but its all wrapped up in 'kernel first handling' for RAS... which we don't yet have. This series is 'fixing' the kernel's SError mask behaviour so that the SEI firmware-first mechanism can (almost) always deliver its notifications, and has somewhere to hook the APEI code into, (like you did for do_sea()). Of course not all systems will have this firmware, so if we took a v8.2 RAS SError on bare-metal we need to do something. This selective-ignoring is an interim fudge to avoid bringing the machine down for something that isn't (yet?) a problem. >From the commit message: > Future patches may add this kernel-first handling. > In the meantime, on both kinds of system we can safely ignore corrected > errors. > We also should be triggering a trace event to notify the user space that > an error happened so that tools like RAS Daemon can report the error. This will > involve a new trace event since the current ones are based of the CPER > structures from the firmware-first case. Hmm, so RAS Daemon is going to end up knowing whether an error was handled kernel-first or firmware-first, that is unfortunate for RAS-Daemon (more code) and means we have duplicate trace points. > Recoverable UEs should not need to trigger the panic, we should be able to do > the recovery similar to the memory fault handling in mm/memory-failure.c code. > The recoverable UEs should also trigger a trace event to user space since they > won't cause a panic as well. I agree, but only once we have code to dig in v8.2's RAS ERR registers to pick out the class of error and affected component or address. Until then we can't know the component or address, so can't handle the error. This is still an improvement over a non-v8.2-RAS aware kernel, as that would panic() for corrected errors too, (depending on when they arrived ... the SError masking is somewhat broken). Thanks, James
next prev parent reply other threads:[~2017-09-14 12:57 UTC|newest] Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-07-28 14:10 [PATCH v2 00/16] SError rework + v8.2 RAS and IESB cpufeature support James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 01/16] arm64: explicitly mask all exceptions James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 02/16] arm64: introduce an order for exceptions James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 03/16] arm64: unmask all exceptions from C code on CPU startup James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 04/16] arm64: entry.S: mask all exceptions during kernel_exit James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 05/16] arm64: entry.S: move enable_step_tsk into kernel_exit James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 06/16] arm64: entry.S: convert elX_sync James Morse 2017-07-28 14:10 ` James Morse 2017-08-09 17:25 ` Catalin Marinas 2017-08-09 17:25 ` Catalin Marinas 2017-08-10 16:57 ` James Morse 2017-08-10 16:57 ` James Morse 2017-08-11 17:24 ` James Morse 2017-08-11 17:24 ` James Morse 2017-07-28 14:10 ` [PATCH v2 07/16] arm64: entry.S: convert elX_irq James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 08/16] arm64: entry.S: move SError handling into a C function for future expansion James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 09/16] arm64: cpufeature: Detect CPU RAS Extentions James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 10/16] arm64: kernel: Survive corrected RAS errors notified by SError James Morse 2017-07-28 14:10 ` James Morse 2017-09-13 20:52 ` Baicar, Tyler 2017-09-13 20:52 ` Baicar, Tyler 2017-09-14 12:58 ` James Morse [this message] 2017-09-14 12:58 ` James Morse 2017-07-28 14:10 ` [PATCH v2 11/16] arm64: kernel: Handle deferred SError on kernel entry James Morse 2017-07-28 14:10 ` James Morse 2017-08-03 17:03 ` James Morse 2017-08-03 17:03 ` James Morse 2017-07-28 14:10 ` [PATCH v2 12/16] arm64: entry.S: Make eret restartable James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 13/16] arm64: cpufeature: Enable Implicit ESB on entry/return-from EL1 James Morse 2017-07-28 14:10 ` James Morse 2017-07-28 14:10 ` [PATCH v2 14/16] KVM: arm64: Take pending SErrors on entry to the guest James Morse 2017-07-28 14:10 ` James Morse 2017-08-01 12:53 ` Christoffer Dall 2017-08-01 12:53 ` Christoffer Dall 2017-07-28 14:10 ` [PATCH v2 15/16] KVM: arm64: Save ESR_EL2 on guest SError James Morse 2017-07-28 14:10 ` James Morse 2017-08-01 13:25 ` Christoffer Dall 2017-08-01 13:25 ` Christoffer Dall 2017-07-28 14:10 ` [PATCH v2 16/16] KVM: arm64: Handle deferred SErrors consumed on guest exit James Morse 2017-07-28 14:10 ` James Morse 2017-08-01 13:18 ` Christoffer Dall 2017-08-01 13:18 ` Christoffer Dall 2017-08-03 17:03 ` James Morse 2017-08-03 17:03 ` James Morse 2017-08-04 13:12 ` Christoffer Dall 2017-08-04 13:12 ` Christoffer Dall
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=59BA7D0E.6020500@arm.com \ --to=james.morse@arm.com \ --cc=catalin.marinas@arm.com \ --cc=kvmarm@lists.cs.columbia.edu \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=marc.zyngier@arm.com \ --cc=tbaicar@codeaurora.org \ --cc=wangxiongfeng2@huawei.com \ --cc=will.deacon@arm.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.