All of lore.kernel.org
 help / color / mirror / Atom feed
* Injecting SLB miltihit crashes kernel 5.9.0-rc5
@ 2020-09-15  8:43 Michal Suchánek
  2020-09-15 12:54 ` Michael Ellerman
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Michal Suchánek @ 2020-09-15  8:43 UTC (permalink / raw)
  To: linuxppc-dev

Hello,

Using the SLB mutihit injection test module (which I did not write so I
do not want to post it here) to verify updates on my 5.3 frankernekernel
I found that the kernel crashes with Oops: kernel bad access.

I tested on latest upstream kernel build that I have at hand and the
result is te same (minus the message - nothing was logged and the kernel
simply rebooted).

Since the whole effort to write a real mode MCE handler was supposed to
prevent this maybe the SLB injection module should be added to the
kernel selftests?

Thanks

Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Injecting SLB miltihit crashes kernel 5.9.0-rc5
  2020-09-15  8:43 Injecting SLB miltihit crashes kernel 5.9.0-rc5 Michal Suchánek
@ 2020-09-15 12:54 ` Michael Ellerman
  2020-09-16  0:51   ` Nicholas Piggin
  2020-09-15 18:06   ` Michal Suchanek
  2020-09-16  7:12 ` Injecting SLB miltihit crashes kernel 5.9.0-rc5 Mahesh Jagannath Salgaonkar
  2 siblings, 1 reply; 10+ messages in thread
From: Michael Ellerman @ 2020-09-15 12:54 UTC (permalink / raw)
  To: Michal Suchánek, linuxppc-dev, mahesh

Michal Suchánek <msuchanek@suse.de> writes:
> Hello,
>
> Using the SLB mutihit injection test module (which I did not write so I
> do not want to post it here) to verify updates on my 5.3 frankernekernel
> I found that the kernel crashes with Oops: kernel bad access.
>
> I tested on latest upstream kernel build that I have at hand and the
> result is te same (minus the message - nothing was logged and the kernel
> simply rebooted).

That's disappointing.

> Since the whole effort to write a real mode MCE handler was supposed to
> prevent this maybe the SLB injection module should be added to the
> kernel selftests?

Yes I'd like to see it upstream. I think it should be integrated into
LKDTM, which contains other dangerous things like that and is designed
for testing how the kernel handles/recovers from bad conditions.

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
  2020-09-15  8:43 Injecting SLB miltihit crashes kernel 5.9.0-rc5 Michal Suchánek
@ 2020-09-15 18:06   ` Michal Suchanek
  2020-09-15 18:06   ` Michal Suchanek
  2020-09-16  7:12 ` Injecting SLB miltihit crashes kernel 5.9.0-rc5 Mahesh Jagannath Salgaonkar
  2 siblings, 0 replies; 10+ messages in thread
From: Michal Suchanek @ 2020-09-15 18:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Michal Suchanek, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Nicholas Piggin, Santosh Sivaraj,
	Christophe Leroy, Ganesh Goudar, Mahesh Salgaonkar,
	Alistair Popple, Jordan Niethe, Mike Rapoport, Peter Zijlstra,
	Aneesh Kumar K.V, Greg Kroah-Hartman, linux-kernel, stable

This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.

This commit causes the kernel to oops and reboot when injecting a SLB
multihit which causes a MCE.

Before this commit a SLB multihit was corrected by the kernel and the
system continued to operate normally.

cc: stable@vger.kernel.org
Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
---
 arch/powerpc/kernel/mce.c   |  7 -------
 arch/powerpc/kernel/traps.c | 18 +++---------------
 2 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ada59f6c4298..2e13528dcc92 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -591,14 +591,10 @@ EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 long notrace machine_check_early(struct pt_regs *regs)
 {
 	long handled = 0;
-	bool nested = in_nmi();
 	u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
 
 	this_cpu_set_ftrace_enabled(0);
 
-	if (!nested)
-		nmi_enter();
-
 	hv_nmi_check_nonrecoverable(regs);
 
 	/*
@@ -607,9 +603,6 @@ long notrace machine_check_early(struct pt_regs *regs)
 	if (ppc_md.machine_check_early)
 		handled = ppc_md.machine_check_early(regs);
 
-	if (!nested)
-		nmi_exit();
-
 	this_cpu_set_ftrace_enabled(ftrace_enabled);
 
 	return handled;
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d1ebe152f210..7853b770918d 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -827,19 +827,7 @@ void machine_check_exception(struct pt_regs *regs)
 {
 	int recover = 0;
 
-	/*
-	 * BOOK3S_64 does not call this handler as a non-maskable interrupt
-	 * (it uses its own early real-mode handler to handle the MCE proper
-	 * and then raises irq_work to call this handler when interrupts are
-	 * enabled).
-	 *
-	 * This is silly. The BOOK3S_64 should just call a different function
-	 * rather than expecting semantics to magically change. Something
-	 * like 'non_nmi_machine_check_exception()', perhaps?
-	 */
-	const bool nmi = !IS_ENABLED(CONFIG_PPC_BOOK3S_64);
-
-	if (nmi) nmi_enter();
+	nmi_enter();
 
 	__this_cpu_inc(irq_stat.mce_exceptions);
 
@@ -865,7 +853,7 @@ void machine_check_exception(struct pt_regs *regs)
 	if (check_io_access(regs))
 		goto bail;
 
-	if (nmi) nmi_exit();
+	nmi_exit();
 
 	die("Machine check", regs, SIGBUS);
 
@@ -876,7 +864,7 @@ void machine_check_exception(struct pt_regs *regs)
 	return;
 
 bail:
-	if (nmi) nmi_exit();
+	nmi_exit();
 }
 
 void SMIException(struct pt_regs *regs)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
@ 2020-09-15 18:06   ` Michal Suchanek
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Suchanek @ 2020-09-15 18:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Aneesh Kumar K.V, Santosh Sivaraj, Alistair Popple,
	Greg Kroah-Hartman, Peter Zijlstra, Mahesh Salgaonkar,
	Nicholas Piggin, linux-kernel, Paul Mackerras, Ganesh Goudar,
	stable, Michal Suchanek, Jordan Niethe, Mike Rapoport

This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.

This commit causes the kernel to oops and reboot when injecting a SLB
multihit which causes a MCE.

Before this commit a SLB multihit was corrected by the kernel and the
system continued to operate normally.

cc: stable@vger.kernel.org
Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
---
 arch/powerpc/kernel/mce.c   |  7 -------
 arch/powerpc/kernel/traps.c | 18 +++---------------
 2 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ada59f6c4298..2e13528dcc92 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -591,14 +591,10 @@ EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 long notrace machine_check_early(struct pt_regs *regs)
 {
 	long handled = 0;
-	bool nested = in_nmi();
 	u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
 
 	this_cpu_set_ftrace_enabled(0);
 
-	if (!nested)
-		nmi_enter();
-
 	hv_nmi_check_nonrecoverable(regs);
 
 	/*
@@ -607,9 +603,6 @@ long notrace machine_check_early(struct pt_regs *regs)
 	if (ppc_md.machine_check_early)
 		handled = ppc_md.machine_check_early(regs);
 
-	if (!nested)
-		nmi_exit();
-
 	this_cpu_set_ftrace_enabled(ftrace_enabled);
 
 	return handled;
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d1ebe152f210..7853b770918d 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -827,19 +827,7 @@ void machine_check_exception(struct pt_regs *regs)
 {
 	int recover = 0;
 
-	/*
-	 * BOOK3S_64 does not call this handler as a non-maskable interrupt
-	 * (it uses its own early real-mode handler to handle the MCE proper
-	 * and then raises irq_work to call this handler when interrupts are
-	 * enabled).
-	 *
-	 * This is silly. The BOOK3S_64 should just call a different function
-	 * rather than expecting semantics to magically change. Something
-	 * like 'non_nmi_machine_check_exception()', perhaps?
-	 */
-	const bool nmi = !IS_ENABLED(CONFIG_PPC_BOOK3S_64);
-
-	if (nmi) nmi_enter();
+	nmi_enter();
 
 	__this_cpu_inc(irq_stat.mce_exceptions);
 
@@ -865,7 +853,7 @@ void machine_check_exception(struct pt_regs *regs)
 	if (check_io_access(regs))
 		goto bail;
 
-	if (nmi) nmi_exit();
+	nmi_exit();
 
 	die("Machine check", regs, SIGBUS);
 
@@ -876,7 +864,7 @@ void machine_check_exception(struct pt_regs *regs)
 	return;
 
 bail:
-	if (nmi) nmi_exit();
+	nmi_exit();
 }
 
 void SMIException(struct pt_regs *regs)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
  2020-09-15 18:06   ` Michal Suchanek
@ 2020-09-15 18:16     ` peterz
  -1 siblings, 0 replies; 10+ messages in thread
From: peterz @ 2020-09-15 18:16 UTC (permalink / raw)
  To: Michal Suchanek
  Cc: linuxppc-dev, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Nicholas Piggin, Santosh Sivaraj,
	Christophe Leroy, Ganesh Goudar, Mahesh Salgaonkar,
	Alistair Popple, Jordan Niethe, Mike Rapoport, Aneesh Kumar K.V,
	Greg Kroah-Hartman, linux-kernel, stable

On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> 
> This commit causes the kernel to oops and reboot when injecting a SLB
> multihit which causes a MCE.
> 
> Before this commit a SLB multihit was corrected by the kernel and the
> system continued to operate normally.
> 
> cc: stable@vger.kernel.org
> Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
> Signed-off-by: Michal Suchanek <msuchanek@suse.de>

Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
nmi_enter() supports nesting natively.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
@ 2020-09-15 18:16     ` peterz
  0 siblings, 0 replies; 10+ messages in thread
From: peterz @ 2020-09-15 18:16 UTC (permalink / raw)
  To: Michal Suchanek
  Cc: Aneesh Kumar K.V, Santosh Sivaraj, Alistair Popple,
	Greg Kroah-Hartman, Mahesh Salgaonkar, Nicholas Piggin,
	linux-kernel, Jordan Niethe, Paul Mackerras, Ganesh Goudar,
	stable, linuxppc-dev, Mike Rapoport

On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> 
> This commit causes the kernel to oops and reboot when injecting a SLB
> multihit which causes a MCE.
> 
> Before this commit a SLB multihit was corrected by the kernel and the
> system continued to operate normally.
> 
> cc: stable@vger.kernel.org
> Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
> Signed-off-by: Michal Suchanek <msuchanek@suse.de>

Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
nmi_enter() supports nesting natively.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Injecting SLB miltihit crashes kernel 5.9.0-rc5
  2020-09-15 12:54 ` Michael Ellerman
@ 2020-09-16  0:51   ` Nicholas Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nicholas Piggin @ 2020-09-16  0:51 UTC (permalink / raw)
  To: linuxppc-dev, mahesh, Michael Ellerman, Michal Suchánek

Excerpts from Michael Ellerman's message of September 15, 2020 10:54 pm:
> Michal Suchánek <msuchanek@suse.de> writes:
>> Hello,
>>
>> Using the SLB mutihit injection test module (which I did not write so I
>> do not want to post it here) to verify updates on my 5.3 frankernekernel
>> I found that the kernel crashes with Oops: kernel bad access.
>>
>> I tested on latest upstream kernel build that I have at hand and the
>> result is te same (minus the message - nothing was logged and the kernel
>> simply rebooted).
> 
> That's disappointing.

It seems to work okay with qemu and mambo injection on upstream
(powernv_defconfig). I wonder why that nmi_enter is crashing.
Can you post the output of a successful test with the patch
reverted?


qemu injection test output - 
[  195.279885][    C0] Disabling lock debugging due to kernel taint
[  195.280891][    C0] MCE: CPU0: machine check (Warning) Host SLB Multihit DAR: 00000000deadbeef [Recovered]
[  195.282117][    C0] MCE: CPU0: NIP: [c00000000003c2b4] isa300_idle_stop_mayloss+0x68/0x6c
[  195.283631][    C0] MCE: CPU0: Initiator CPU
[  195.284432][    C0] MCE: CPU0: Probable Software error (some chance of hardware cause)
[  220.711577][   T90] MCE: CPU0: machine check (Warning) Host SLB Multihit DAR: 00000000deadbeef [Recovered]
[  220.712805][   T90] MCE: CPU0: PID: 90 Comm: yes NIP: [00007fff7fdac2e0]
[  220.713553][   T90] MCE: CPU0: Initiator CPU
[  220.714021][   T90] MCE: CPU0: Probable Software error (some chance of hardware cause)

Thanks,
Nick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Injecting SLB miltihit crashes kernel 5.9.0-rc5
  2020-09-15  8:43 Injecting SLB miltihit crashes kernel 5.9.0-rc5 Michal Suchánek
  2020-09-15 12:54 ` Michael Ellerman
  2020-09-15 18:06   ` Michal Suchanek
@ 2020-09-16  7:12 ` Mahesh Jagannath Salgaonkar
  2 siblings, 0 replies; 10+ messages in thread
From: Mahesh Jagannath Salgaonkar @ 2020-09-16  7:12 UTC (permalink / raw)
  To: Michal Suchánek, linuxppc-dev

On 9/15/20 2:13 PM, Michal Suchánek wrote:
> Hello,
> 
> Using the SLB mutihit injection test module (which I did not write so I
> do not want to post it here) to verify updates on my 5.3 frankernekernel
> I found that the kernel crashes with Oops: kernel bad access.
> 
> I tested on latest upstream kernel build that I have at hand and the
> result is te same (minus the message - nothing was logged and the kernel
> simply rebooted).


Yes, SLB multihit recovery is broken upstream. Fix is on the way.


> 
> Since the whole effort to write a real mode MCE handler was supposed to
> prevent this maybe the SLB injection module should be added to the
> kernel selftests?

Yes. We are working on adding SLB injection selftest patches will be
posted soon.

Thanks,
-Mahesh.

> 
> Thanks
> 
> Michal
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
  2020-09-15 18:16     ` peterz
@ 2020-09-16  9:00       ` Michal Suchánek
  -1 siblings, 0 replies; 10+ messages in thread
From: Michal Suchánek @ 2020-09-16  9:00 UTC (permalink / raw)
  To: peterz
  Cc: linuxppc-dev, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Nicholas Piggin, Santosh Sivaraj,
	Christophe Leroy, Ganesh Goudar, Mahesh Salgaonkar,
	Alistair Popple, Jordan Niethe, Mike Rapoport, Aneesh Kumar K.V,
	Greg Kroah-Hartman, linux-kernel, stable

On Tue, Sep 15, 2020 at 08:16:42PM +0200, peterz@infradead.org wrote:
> On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> > This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> > 
> > This commit causes the kernel to oops and reboot when injecting a SLB
> > multihit which causes a MCE.
> > 
> > Before this commit a SLB multihit was corrected by the kernel and the
> > system continued to operate normally.
> > 
> > cc: stable@vger.kernel.org
> > Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
> > Signed-off-by: Michal Suchanek <msuchanek@suse.de>
> 
> Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
> nmi_enter() supports nesting natively.

And this patch was merged in parallel with this native nesting support
and conflicted with it - hence the explicit nesting in the hunk that did
not conflict.

Either way the bug is present on kernels both with and without
69ea03b56ed2. So besides the conflict 69ea03b56ed2 does not affect this
problem.

Thanks

Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
@ 2020-09-16  9:00       ` Michal Suchánek
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Suchánek @ 2020-09-16  9:00 UTC (permalink / raw)
  To: peterz
  Cc: Aneesh Kumar K.V, Santosh Sivaraj, Alistair Popple,
	Greg Kroah-Hartman, Mahesh Salgaonkar, Nicholas Piggin,
	linux-kernel, Jordan Niethe, Paul Mackerras, Ganesh Goudar,
	stable, linuxppc-dev, Mike Rapoport

On Tue, Sep 15, 2020 at 08:16:42PM +0200, peterz@infradead.org wrote:
> On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> > This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> > 
> > This commit causes the kernel to oops and reboot when injecting a SLB
> > multihit which causes a MCE.
> > 
> > Before this commit a SLB multihit was corrected by the kernel and the
> > system continued to operate normally.
> > 
> > cc: stable@vger.kernel.org
> > Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
> > Signed-off-by: Michal Suchanek <msuchanek@suse.de>
> 
> Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
> nmi_enter() supports nesting natively.

And this patch was merged in parallel with this native nesting support
and conflicted with it - hence the explicit nesting in the hunk that did
not conflict.

Either way the bug is present on kernels both with and without
69ea03b56ed2. So besides the conflict 69ea03b56ed2 does not affect this
problem.

Thanks

Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-09-16  9:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-15  8:43 Injecting SLB miltihit crashes kernel 5.9.0-rc5 Michal Suchánek
2020-09-15 12:54 ` Michael Ellerman
2020-09-16  0:51   ` Nicholas Piggin
2020-09-15 18:06 ` [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting" Michal Suchanek
2020-09-15 18:06   ` Michal Suchanek
2020-09-15 18:16   ` peterz
2020-09-15 18:16     ` peterz
2020-09-16  9:00     ` Michal Suchánek
2020-09-16  9:00       ` Michal Suchánek
2020-09-16  7:12 ` Injecting SLB miltihit crashes kernel 5.9.0-rc5 Mahesh Jagannath Salgaonkar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.