During the safety analysis that was done in the context of the ELISA project by the safety architecture working group some incorrectnesses were spotted. This patchset proposes some fixes. Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Gabriele Paoloni (4): x86/mce: do not overwrite no_way_out if mce_end() fails x86/mce: move the mce_panic() call and kill_it assignments at the right places x86/mce: for LMCE panic only if mca_cfg.tolerant < 3 x86/mce: remove redundant call to irq_work_queue() arch/x86/kernel/cpu/mce/core.c | 28 +++++++++++----------------- 1 file changed, 11 insertions(+), 17 deletions(-) -- 2.20.1 --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#175): https://lists.elisa.tech/g/linux-safety/message/175 Mute This Topic: https://lists.elisa.tech/mt/78342499/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
Currently if mce_end() fails no_way_out is set equal to worst. worst is the worst severirty that was found in the MCA banks associated to the current CPU; however at this point no_way_out could be already set by mca_start() by looking at all severities of all CPUs that entered the MCE handler. if mce_end() fails we first check if no_way_out is already set and if so we stick to it, otherwise we use the local worst value Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4102b866e7c0..b990892c6766 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1385,7 +1385,7 @@ noinstr void do_machine_check(struct pt_regs *regs) */ if (!lmce) { if (mce_end(order) < 0) - no_way_out = worst >= MCE_PANIC_SEVERITY; + no_way_out = no_way_out ? no_way_out : worst >= MCE_PANIC_SEVERITY; } else { /* * If there was a fatal machine check we should have -- 2.20.1 --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#177): https://lists.elisa.tech/g/linux-safety/message/177 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
Right now for local MCEs we panic(),if needed, right after lmce is set. For global MCEs mce_reign() takes care of calling mce_panic(). Hence this patch: - improves readibility by moving the conditional evaluation of tolerant up to when kill_it is set first - moves the mce_panic() call up into the statement where mce_end() fails Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index b990892c6766..e025ff04438f 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1350,8 +1350,7 @@ noinstr void do_machine_check(struct pt_regs *regs) * severity is MCE_AR_SEVERITY we have other options. */ if (!(m.mcgstatus & MCG_STATUS_RIPV)) - kill_it = 1; - + kill_it = (cfg->tolerant == 3) ? 0 : 1; /* * Check if this MCE is signaled to only this logical processor, * on Intel, Zhaoxin only. @@ -1384,8 +1383,15 @@ noinstr void do_machine_check(struct pt_regs *regs) * When there's any problem use only local no_way_out state. */ if (!lmce) { - if (mce_end(order) < 0) + if (mce_end(order) < 0) { no_way_out = no_way_out ? no_way_out : worst >= MCE_PANIC_SEVERITY; + /* + * mce_reign() has probably failed hence evaluate if we need + * to panic + */ + if (no_way_out && mca_cfg.tolerant < 3) + mce_panic("Fatal machine check on current CPU", &m, msg); + } } else { /* * If there was a fatal machine check we should have @@ -1401,15 +1407,6 @@ noinstr void do_machine_check(struct pt_regs *regs) } } - /* - * If tolerant is at an insane level we drop requests to kill - * processes and continue even when there is no way out. - */ - if (cfg->tolerant == 3) - kill_it = 0; - else if (no_way_out) - mce_panic("Fatal machine check on current CPU", &m, msg); - if (worst > 0) irq_work_queue(&mce_irq_work); -- 2.20.1 --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#178): https://lists.elisa.tech/g/linux-safety/message/178 Mute This Topic: https://lists.elisa.tech/mt/78342502/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
Right now for LMCE if no_way_out is set mce_panic() is called regardless of mca_cfg.tolerant. This is not correct as if mca_cfg.tolerant = 3 we should never panic. Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e025ff04438f..d16cbb05b09c 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1367,7 +1367,7 @@ noinstr void do_machine_check(struct pt_regs *regs) * to see it will clear it. */ if (lmce) { - if (no_way_out) + if (no_way_out && mca_cfg.tolerant < 3) mce_panic("Fatal local machine check", &m, msg); } else { order = mce_start(&no_way_out); -- 2.20.1 --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#176): https://lists.elisa.tech/g/linux-safety/message/176 Mute This Topic: https://lists.elisa.tech/mt/78342500/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
Right now in do_machine_check() we have: __mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work) hence the call of irq_work_queue() below after __mc_scan_banks() seems redundant. Just remove it. Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index d16cbb05b09c..f2f7bfc60c67 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1407,9 +1407,6 @@ noinstr void do_machine_check(struct pt_regs *regs) } } - if (worst > 0) - irq_work_queue(&mce_irq_work); - if (worst != MCE_AR_SEVERITY && !kill_it) goto out; -- 2.20.1 --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#179): https://lists.elisa.tech/g/linux-safety/message/179 Mute This Topic: https://lists.elisa.tech/mt/78342513/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Wed, Nov 18, 2020 at 03:15:49PM +0000, Gabriele Paoloni wrote: > Currently if mce_end() fails no_way_out is set equal to worst. > worst is the worst severirty that was found in the MCA banks ^^^^^^^^^ Please introduce a spellchecker into your patch creation workflow. > associated to the current CPU; however at this point no_way_out ^ with > could be already set by mca_start() by looking at all severities I think you mean "could have been already set" here > of all CPUs that entered the MCE handler. > if mce_end() fails we first check if no_way_out is already set and Please use passive voice in your commit message: no "we" or "I", etc. Also, pls start new sentences with a capital letter and end them with a fullstop. > if so we stick to it, otherwise we use the local worst value So basically you're trying to say here that no_way_out might have been already set and other CPUs could overwrite it and that should not happen. Is that what you mean? > Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> > Reviewed-by: Tony Luck <tony.luck@intel.com> > --- > arch/x86/kernel/cpu/mce/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 4102b866e7c0..b990892c6766 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1385,7 +1385,7 @@ noinstr void do_machine_check(struct pt_regs *regs) > */ > if (!lmce) { > if (mce_end(order) < 0) > - no_way_out = worst >= MCE_PANIC_SEVERITY; > + no_way_out = no_way_out ? no_way_out : worst >= MCE_PANIC_SEVERITY; I had to stare at this a bit to figure out what you're doing. So how about simplifying this: if (!no_way_out) no_way_out = worst >= MCE_PANIC_SEVERITY; ? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#183): https://lists.elisa.tech/g/linux-safety/message/183 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
Hi Boris > -----Original Message----- > From: Borislav Petkov <bp@alien8.de> > Sent: Friday, November 20, 2020 6:08 PM > To: Paoloni, Gabriele <gabriele.paoloni@intel.com> > Cc: Luck, Tony <tony.luck@intel.com>; tglx@linutronix.de; > mingo@redhat.com; x86@kernel.org; hpa@zytor.com; linux- > edac@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > safety@lists.elisa.tech > Subject: Re: [PATCH 1/4] x86/mce: do not overwrite no_way_out if > mce_end() fails > > On Wed, Nov 18, 2020 at 03:15:49PM +0000, Gabriele Paoloni wrote: > > Currently if mce_end() fails no_way_out is set equal to worst. > > worst is the worst severirty that was found in the MCA banks > ^^^^^^^^^ > > Please introduce a spellchecker into your patch creation workflow. > > > associated to the current CPU; however at this point no_way_out > ^ > with > > > > could be already set by mca_start() by looking at all severities > > I think you mean "could have been already set" here > > > of all CPUs that entered the MCE handler. > > if mce_end() fails we first check if no_way_out is already set and > > Please use passive voice in your commit message: no "we" or "I", etc. > > Also, pls start new sentences with a capital letter and end them with a > fullstop. Sorry about the grammar errors above, I'll pay more attention in future > > > if so we stick to it, otherwise we use the local worst value > > So basically you're trying to say here that no_way_out might have been > already set and other CPUs could overwrite it and that should not > happen. > > Is that what you mean? I mean that on this CPU thread at this point mce_start() already cached global_nwo and hence could accumulate fatal severities of other CPUs. Now here if mce_end() fails we only consider the local 'worst' severity and we overwrite those already cached. > > > Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> > > Reviewed-by: Tony Luck <tony.luck@intel.com> > > --- > > arch/x86/kernel/cpu/mce/core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/mce/core.c > b/arch/x86/kernel/cpu/mce/core.c > > index 4102b866e7c0..b990892c6766 100644 > > --- a/arch/x86/kernel/cpu/mce/core.c > > +++ b/arch/x86/kernel/cpu/mce/core.c > > @@ -1385,7 +1385,7 @@ noinstr void do_machine_check(struct pt_regs > *regs) > > */ > > if (!lmce) { > > if (mce_end(order) < 0) > > - no_way_out = worst >= MCE_PANIC_SEVERITY; > > + no_way_out = no_way_out ? no_way_out : worst >= > MCE_PANIC_SEVERITY; > > I had to stare at this a bit to figure out what you're doing. So how > about simplifying this: > > if (!no_way_out) > no_way_out = worst >= Yes that works as well improving readability. If ok I will fix the grammar and rewrite this code in v2. Many Thanks Gab > MCE_PANIC_SEVERITY; > > ? > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#180): https://lists.elisa.tech/g/linux-safety/message/180 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Wed, Nov 18, 2020 at 03:15:49PM +0000, Gabriele Paoloni wrote: > Currently if mce_end() fails no_way_out is set equal to worst. > worst is the worst severirty that was found in the MCA banks > associated to the current CPU; however at this point no_way_out > could be already set by mca_start() by looking at all severities > of all CPUs that entered the MCE handler. > if mce_end() fails we first check if no_way_out is already set and > if so we stick to it, otherwise we use the local worst value > > Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> > Reviewed-by: Tony Luck <tony.luck@intel.com> Also, this very likely wants Cc: stable, I'd say, considering the severity. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#184): https://lists.elisa.tech/g/linux-safety/message/184 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Fri, Nov 20, 2020 at 05:31:32PM +0000, Paoloni, Gabriele wrote: > I mean that on this CPU thread at this point mce_start() already cached > global_nwo and hence could accumulate fatal severities of other CPUs. > > Now here if mce_end() fails we only consider the local 'worst' severity > and we overwrite those already cached. Yap, we're on the same page. :) > If ok I will fix the grammar and rewrite this code in v2. Sure, lemme go through the rest first. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#185): https://lists.elisa.tech/g/linux-safety/message/185 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
[...] > Also, this very likely wants Cc: stable, I'd say, considering the > severity. Sure, will add stable in v2. Thanks Gab > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#181): https://lists.elisa.tech/g/linux-safety/message/181 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Wed, Nov 18, 2020 at 03:15:50PM +0000, Gabriele Paoloni wrote: > Right now for local MCEs we panic(),if needed, right after lmce is > set. For global MCEs mce_reign() takes care of calling mce_panic(). > Hence this patch: > - improves readibility by moving the conditional evaluation of > tolerant up to when kill_it is set first > - moves the mce_panic() call up into the statement where mce_end() > fails Pls avoid using "this patch does this and that" in the commit message but say directly what it does: - Improve readability ... - Move mce_panic()... and so on. > Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> > Reviewed-by: Tony Luck <tony.luck@intel.com> > --- > arch/x86/kernel/cpu/mce/core.c | 21 +++++++++------------ > 1 file changed, 9 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index b990892c6766..e025ff04438f 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1350,8 +1350,7 @@ noinstr void do_machine_check(struct pt_regs *regs) > * severity is MCE_AR_SEVERITY we have other options. > */ > if (!(m.mcgstatus & MCG_STATUS_RIPV)) > - kill_it = 1; > - > + kill_it = (cfg->tolerant == 3) ? 0 : 1; So you just set kill_it using cfg->tolerant... > /* > * Check if this MCE is signaled to only this logical processor, > * on Intel, Zhaoxin only. > @@ -1384,8 +1383,15 @@ noinstr void do_machine_check(struct pt_regs *regs) > * When there's any problem use only local no_way_out state. > */ > if (!lmce) { > - if (mce_end(order) < 0) > + if (mce_end(order) < 0) { > no_way_out = no_way_out ? no_way_out : worst >= MCE_PANIC_SEVERITY; > + /* > + * mce_reign() has probably failed hence evaluate if we need > + * to panic > + */ > + if (no_way_out && mca_cfg.tolerant < 3) ... but here you're testing cfg->tolerant again. why not if (no_way_out && kill_it) ? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#192): https://lists.elisa.tech/g/linux-safety/message/192 Mute This Topic: https://lists.elisa.tech/mt/78342502/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Fri, Nov 20, 2020 at 06:33:42PM +0100, Borislav Petkov wrote: > Sure, lemme go through the rest first. Done, thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#193): https://lists.elisa.tech/g/linux-safety/message/193 Mute This Topic: https://lists.elisa.tech/mt/78342501/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
Hi Boris > -----Original Message----- > From: Borislav Petkov <bp@alien8.de> > Sent: Monday, November 23, 2020 3:28 PM > To: Paoloni, Gabriele <gabriele.paoloni@intel.com> > Cc: Luck, Tony <tony.luck@intel.com>; tglx@linutronix.de; > mingo@redhat.com; x86@kernel.org; hpa@zytor.com; linux- > edac@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > safety@lists.elisa.tech > Subject: Re: [PATCH 2/4] x86/mce: move the mce_panic() call and kill_it > assignments at the right places > > On Wed, Nov 18, 2020 at 03:15:50PM +0000, Gabriele Paoloni wrote: > > Right now for local MCEs we panic(),if needed, right after lmce is > > set. For global MCEs mce_reign() takes care of calling mce_panic(). > > Hence this patch: > > - improves readibility by moving the conditional evaluation of > > tolerant up to when kill_it is set first > > - moves the mce_panic() call up into the statement where mce_end() > > fails > > Pls avoid using "this patch does this and that" in the commit message > but say directly what it does: > > - Improve readability ... > > - Move mce_panic()... > > and so on. Thanks, I'll fix it in v2 > > > Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> > > Reviewed-by: Tony Luck <tony.luck@intel.com> > > --- > > arch/x86/kernel/cpu/mce/core.c | 21 +++++++++------------ > > 1 file changed, 9 insertions(+), 12 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/mce/core.c > b/arch/x86/kernel/cpu/mce/core.c > > index b990892c6766..e025ff04438f 100644 > > --- a/arch/x86/kernel/cpu/mce/core.c > > +++ b/arch/x86/kernel/cpu/mce/core.c > > @@ -1350,8 +1350,7 @@ noinstr void do_machine_check(struct pt_regs > *regs) > > * severity is MCE_AR_SEVERITY we have other options. > > */ > > if (!(m.mcgstatus & MCG_STATUS_RIPV)) > > - kill_it = 1; > > - > > + kill_it = (cfg->tolerant == 3) ? 0 : 1; > > So you just set kill_it using cfg->tolerant... Well I fist see if RIPV is not set; the I check the tolerance level to see if we need to kill the user space app... > > > /* > > * Check if this MCE is signaled to only this logical processor, > > * on Intel, Zhaoxin only. > > @@ -1384,8 +1383,15 @@ noinstr void do_machine_check(struct pt_regs > *regs) > > * When there's any problem use only local no_way_out state. > > */ > > if (!lmce) { > > - if (mce_end(order) < 0) > > + if (mce_end(order) < 0) { > > no_way_out = no_way_out ? no_way_out : worst >= > MCE_PANIC_SEVERITY; > > + /* > > + * mce_reign() has probably failed hence evaluate if > we need > > + * to panic > > + */ > > + if (no_way_out && mca_cfg.tolerant < 3) > > ... but here you're testing cfg->tolerant again. Yes because the tolerant flag tells me if I need to take action... > > why not > > if (no_way_out && kill_it) > > ? From my understanding no_way_out and kill_it are different in principles: no_way_out is telling that an error occurred 'somewhere' in some CPU bank that requires the system to panic (e.g. PCC=1); kill_it is saying that the execution cannot be restarted where it left for the local CPU and hence we need to find an alternative solution as part of the recovery action. In practice it seems to me that kill_it is used to replace kill_me_maybe with kill_me_now in case the exception happened in user mode. So If I where using the statement "if (no_way_out && kill_it)" I would miss to panic, for example, in cases where no_way_out captured a fatal error somewhere in other CPUs but RIPV is set for the local CPU... Thanks Gab > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#191): https://lists.elisa.tech/g/linux-safety/message/191 Mute This Topic: https://lists.elisa.tech/mt/78342502/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Mon, Nov 23, 2020 at 05:06:31PM +0000, Paoloni, Gabriele wrote: > From my understanding no_way_out and kill_it are different in principles: > no_way_out is telling that an error occurred 'somewhere' in some CPU bank > that requires the system to panic (e.g. PCC=1); kill_it is saying that the execution > cannot be restarted where it left for the local CPU and hence we need to find > an alternative solution as part of the recovery action. In practice it seems to > me that kill_it is used to replace kill_me_maybe with kill_me_now in case > the exception happened in user mode. Bah, I got confused, sorry about that - you're right. Btw, that kill_it should probably be called "kill_current_task" or so to make it more clear. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#194): https://lists.elisa.tech/g/linux-safety/message/194 Mute This Topic: https://lists.elisa.tech/mt/78342502/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
> -----Original Message----- > From: Borislav Petkov <bp@alien8.de> > Sent: Monday, November 23, 2020 6:19 PM > To: Paoloni, Gabriele <gabriele.paoloni@intel.com> > Cc: Luck, Tony <tony.luck@intel.com>; tglx@linutronix.de; > mingo@redhat.com; x86@kernel.org; hpa@zytor.com; linux- > edac@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > safety@lists.elisa.tech > Subject: Re: [PATCH 2/4] x86/mce: move the mce_panic() call and kill_it > assignments at the right places > > On Mon, Nov 23, 2020 at 05:06:31PM +0000, Paoloni, Gabriele wrote: > > From my understanding no_way_out and kill_it are different in principles: > > no_way_out is telling that an error occurred 'somewhere' in some CPU > bank > > that requires the system to panic (e.g. PCC=1); kill_it is saying that the > execution > > cannot be restarted where it left for the local CPU and hence we need to > find > > an alternative solution as part of the recovery action. In practice it seems to > > me that kill_it is used to replace kill_me_maybe with kill_me_now in case > > the exception happened in user mode. > > Bah, I got confused, sorry about that - you're right. Well it is not the easiest code to decode 😊 > > Btw, that kill_it should probably be called "kill_current_task" or so to > make it more clear. Sure I can add another patch to the set to rename it. Gab > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#195): https://lists.elisa.tech/g/linux-safety/message/195 Mute This Topic: https://lists.elisa.tech/mt/78342502/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-
On Mon, Nov 23, 2020 at 05:40:21PM +0000, Paoloni, Gabriele wrote: > Well it is not the easiest code to decode 😊 Tell me about it - that's decades worth of crap being piled ontop. :-) > Sure I can add another patch to the set to rename it. Yeah, only if you really want to - that was more a note-to-self to take care of it eventually. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#196): https://lists.elisa.tech/g/linux-safety/message/196 Mute This Topic: https://lists.elisa.tech/mt/78342502/5278000 Group Owner: linux-safety+owner@lists.elisa.tech Unsubscribe: https://lists.elisa.tech/g/linux-safety/unsub [linux-safety@archiver.kernel.org] -=-=-=-=-=-=-=-=-=-=-=-