All of lore.kernel.org
 help / color / mirror / Atom feed
* [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-17  1:35 Xogium
  2019-09-17 10:45   ` Will Deacon
  0 siblings, 1 reply; 13+ messages in thread
From: Xogium @ 2019-09-17  1:35 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Will Deacon

On arm64 in some situations userspace will continue running even after a panic. This means any userspace watchdog daemon will continue pinging, that service 
managers will keep running and displaying messages in certain cases, and that it is possible to enter via ssh in the now unstable system and to do almost 
anything except reboot/power off and etc. If CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed. I have reproduced the very same behavior 
with linux 4.19, 5.2 and 5.3. On x86/x86_64 the issue does not seem to be present at all. Also, kernels without commit 8341f2f222d729688014ce8306727fdb9798d37e 
don't trigger a broken panic using 'echo c > /proc/sysrq-trigger', instead they call die() through the memory manager which works as intended, because it causes 
an oops which ends in a panic, and don't call the panic() function directly. By patching the poweroff sysrq-trigger to panic i can confirm the issue is 
definitely present in kernel 4.19 on qemu. On actual hardware I used a marvell ESPRESSOBin with linux 5.2.14. The issue seemed to be quite random at first, but 
it can be triggered 100% of the time by adding nosmp on the kernel command line. Also if e.g: panic=30 is added on the kernel command line, the problem is also 
worked around and disappears entirely, using nosmp or not.

The easiest way to reproduce this is using qemu and this initramfs containing busybox and the following init script:

    #!/bin/sh
    busybox mkdir /proc
    busybox mount -t proc none /proc
    # Launch some programs to run in the background
    while true; do echo "Ping 1!"; busybox sleep 1; done >/dev/console&
    while true; do echo "Ping 2!"; busybox sleep 2; done >/dev/console&
    echo c > /proc/sysrq-trigger
    # Nothing should be running from here on out
    echo "Running a shell now!"
    exec busybox sh

A copy of the initramfs and a 5.2 arm64 defconfig kernel can be found at:
http://novena.jookia.org/arm64bug/mycpio
http://novena.jookia.org/arm64bug/Image

You can run it in qemu using:
qemu-system-aarch64 -machine virt-4.0 -cpu cortex-a53 -m 256 -kernel Image -initrd mycpio -nographic

As an example, running it with linux 5.2.15 with the arm64 defconfig in qemu gives this:

    [    1.841502] Run /init as init process
    [    1.970386] sysrq: Trigger a crash
    [    1.970967] Kernel panic - not syncing: sysrq triggered crash
    [    1.971693] CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
    [    1.972096] Hardware name: linux,dummy-virt (DT)
    [    1.972661] Call trace:
    [    1.972919]  dump_backtrace+0x0/0x148
    [    1.973271]  show_stack+0x14/0x20
    [    1.973472]  dump_stack+0xa0/0xc4
    [    1.973699]  panic+0x140/0x32c
    [    1.973897]  sysrq_handle_reboot+0x0/0x20
    [    1.974161]  __handle_sysrq+0x124/0x190
    [    1.974422]  write_sysrq_trigger+0x64/0x88
    [    1.974715]  proc_reg_write+0x60/0xa8
    [    1.974973]  __vfs_write+0x18/0x40
    [    1.975224]  vfs_write+0xa4/0x1b8
    [    1.975474]  ksys_write+0x64/0xf0
    [    1.975739]  __arm64_sys_write+0x14/0x20
    [    1.976021]  el0_svc_common.constprop.0+0xb0/0x168
    [    1.976375]  el0_svc_handler+0x28/0x78
    [    1.976661]  el0_svc+0x8/0xc
    [    1.977383] Kernel Offset: disabled
    [    1.977895] CPU features: 0x0002,24002004
    [    1.978241] Memory Limit: none
    [    1.979169] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
    Ping 2!
    Ping 1!
    Ping 1!
    Ping 2!


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
  2019-09-17  1:35 [breakage] panic() does not halt arm64 systems under certain conditions Xogium
@ 2019-09-17 10:45   ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-17 10:45 UTC (permalink / raw)
  To: Xogium
  Cc: linux-arm-kernel, tglx, mingo, bp, gregkh, linux-arch,
	linux-kernel, linux

Hi,

[Expanding CC list; original message is here:
 https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]

On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> On arm64 in some situations userspace will continue running even after a
> panic. This means any userspace watchdog daemon will continue pinging,
> that service managers will keep running and displaying messages in certain
> cases, and that it is possible to enter via ssh in the now unstable system
> and to do almost anything except reboot/power off and etc. If
> CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> x86/x86_64 the issue does not seem to be present at all.

I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
The issue is that the infinite loop at the end of panic() can run with
preemption enabled (particularly when invoking by echoing 'c' to
/proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
doesn't happen because smp_send_stop() disables the local APIC in
native_stop_other_cpus() and so interrupts are effectively masked while
spinning.

A straightforward fix is to disable preemption explicitly on the panic()
path (diff below), but I've expanded the cc list to see both what others
think, but also in case smp_send_stop() is supposed to have the side-effect
of disabling interrupt delivery for the local CPU.

Will

--->8

diff --git a/kernel/panic.c b/kernel/panic.c
index 057540b6eee9..02d0de31c42d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -179,6 +179,7 @@ void panic(const char *fmt, ...)
	 * after setting panic_cpu) from invoking panic() again.
	 */
	local_irq_disable();
+	preempt_disable_notrace();
 
	/*
	 * It's possible to come here directly from a panic-assertion and

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-17 10:45   ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-17 10:45 UTC (permalink / raw)
  To: Xogium
  Cc: linux-arch, gregkh, linux-kernel, linux, mingo, bp, tglx,
	linux-arm-kernel

Hi,

[Expanding CC list; original message is here:
 https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]

On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> On arm64 in some situations userspace will continue running even after a
> panic. This means any userspace watchdog daemon will continue pinging,
> that service managers will keep running and displaying messages in certain
> cases, and that it is possible to enter via ssh in the now unstable system
> and to do almost anything except reboot/power off and etc. If
> CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> x86/x86_64 the issue does not seem to be present at all.

I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
The issue is that the infinite loop at the end of panic() can run with
preemption enabled (particularly when invoking by echoing 'c' to
/proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
doesn't happen because smp_send_stop() disables the local APIC in
native_stop_other_cpus() and so interrupts are effectively masked while
spinning.

A straightforward fix is to disable preemption explicitly on the panic()
path (diff below), but I've expanded the cc list to see both what others
think, but also in case smp_send_stop() is supposed to have the side-effect
of disabling interrupt delivery for the local CPU.

Will

--->8

diff --git a/kernel/panic.c b/kernel/panic.c
index 057540b6eee9..02d0de31c42d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -179,6 +179,7 @@ void panic(const char *fmt, ...)
	 * after setting panic_cpu) from invoking panic() again.
	 */
	local_irq_disable();
+	preempt_disable_notrace();
 
	/*
	 * It's possible to come here directly from a panic-assertion and

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
  2019-09-17 10:45   ` Will Deacon
  (?)
@ 2019-09-17 10:51     ` Russell King - ARM Linux admin
  -1 siblings, 0 replies; 13+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 10:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: Xogium, linux-arm-kernel, tglx, mingo, bp, gregkh, linux-arch,
	linux-kernel

On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> Hi,
> 
> [Expanding CC list; original message is here:
>  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> 
> On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > On arm64 in some situations userspace will continue running even after a
> > panic. This means any userspace watchdog daemon will continue pinging,
> > that service managers will keep running and displaying messages in certain
> > cases, and that it is possible to enter via ssh in the now unstable system
> > and to do almost anything except reboot/power off and etc. If
> > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > x86/x86_64 the issue does not seem to be present at all.
> 
> I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> The issue is that the infinite loop at the end of panic() can run with
> preemption enabled (particularly when invoking by echoing 'c' to
> /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> doesn't happen because smp_send_stop() disables the local APIC in
> native_stop_other_cpus() and so interrupts are effectively masked while
> spinning.
> 
> A straightforward fix is to disable preemption explicitly on the panic()
> path (diff below), but I've expanded the cc list to see both what others
> think,

Yep, and it looks like this bug goes back into the dim and distant past.
At least to the start of modern git history, 2.6.12-rc2.

> but also in case smp_send_stop() is supposed to have the side-effect
> of disabling interrupt delivery for the local CPU.

That can't fix it.  Consider a preemptive non-SMP kernel.
smp_send_stop() becomes a no-op there.

I'd suggest that a preemptive UP kernel on x86 hardware will suffer
this same issue - it will be able to preempt out of this loop and
continue running userspace.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-17 10:51     ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 13+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 10:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, gregkh, Xogium, linux-kernel, mingo, bp, tglx,
	linux-arm-kernel

On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> Hi,
> 
> [Expanding CC list; original message is here:
>  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> 
> On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > On arm64 in some situations userspace will continue running even after a
> > panic. This means any userspace watchdog daemon will continue pinging,
> > that service managers will keep running and displaying messages in certain
> > cases, and that it is possible to enter via ssh in the now unstable system
> > and to do almost anything except reboot/power off and etc. If
> > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > x86/x86_64 the issue does not seem to be present at all.
> 
> I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> The issue is that the infinite loop at the end of panic() can run with
> preemption enabled (particularly when invoking by echoing 'c' to
> /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> doesn't happen because smp_send_stop() disables the local APIC in
> native_stop_other_cpus() and so interrupts are effectively masked while
> spinning.
> 
> A straightforward fix is to disable preemption explicitly on the panic()
> path (diff below), but I've expanded the cc list to see both what others
> think,

Yep, and it looks like this bug goes back into the dim and distant past.
At least to the start of modern git history, 2.6.12-rc2.

> but also in case smp_send_stop() is supposed to have the side-effect
> of disabling interrupt delivery for the local CPU.

That can't fix it.  Consider a preemptive non-SMP kernel.
smp_send_stop() becomes a no-op there.

I'd suggest that a preemptive UP kernel on x86 hardware will suffer
this same issue - it will be able to preempt out of this loop and
continue running userspace.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-17 10:51     ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 13+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-17 10:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, gregkh, Xogium, linux-kernel, mingo, bp, tglx,
	linux-arm-kernel

On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> Hi,
> 
> [Expanding CC list; original message is here:
>  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> 
> On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > On arm64 in some situations userspace will continue running even after a
> > panic. This means any userspace watchdog daemon will continue pinging,
> > that service managers will keep running and displaying messages in certain
> > cases, and that it is possible to enter via ssh in the now unstable system
> > and to do almost anything except reboot/power off and etc. If
> > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > x86/x86_64 the issue does not seem to be present at all.
> 
> I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> The issue is that the infinite loop at the end of panic() can run with
> preemption enabled (particularly when invoking by echoing 'c' to
> /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> doesn't happen because smp_send_stop() disables the local APIC in
> native_stop_other_cpus() and so interrupts are effectively masked while
> spinning.
> 
> A straightforward fix is to disable preemption explicitly on the panic()
> path (diff below), but I've expanded the cc list to see both what others
> think,

Yep, and it looks like this bug goes back into the dim and distant past.
At least to the start of modern git history, 2.6.12-rc2.

> but also in case smp_send_stop() is supposed to have the side-effect
> of disabling interrupt delivery for the local CPU.

That can't fix it.  Consider a preemptive non-SMP kernel.
smp_send_stop() becomes a no-op there.

I'd suggest that a preemptive UP kernel on x86 hardware will suffer
this same issue - it will be able to preempt out of this loop and
continue running userspace.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
  2019-09-17 10:51     ` Russell King - ARM Linux admin
  (?)
@ 2019-09-17 11:05       ` Will Deacon
  -1 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-17 11:05 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Xogium, linux-arm-kernel, tglx, mingo, bp, gregkh, linux-arch,
	linux-kernel

On Tue, Sep 17, 2019 at 11:51:36AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> > [Expanding CC list; original message is here:
> >  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> > 
> > On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > > On arm64 in some situations userspace will continue running even after a
> > > panic. This means any userspace watchdog daemon will continue pinging,
> > > that service managers will keep running and displaying messages in certain
> > > cases, and that it is possible to enter via ssh in the now unstable system
> > > and to do almost anything except reboot/power off and etc. If
> > > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > > x86/x86_64 the issue does not seem to be present at all.
> > 
> > I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> > The issue is that the infinite loop at the end of panic() can run with
> > preemption enabled (particularly when invoking by echoing 'c' to
> > /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> > doesn't happen because smp_send_stop() disables the local APIC in
> > native_stop_other_cpus() and so interrupts are effectively masked while
> > spinning.
> > 
> > A straightforward fix is to disable preemption explicitly on the panic()
> > path (diff below), but I've expanded the cc list to see both what others
> > think,
> 
> Yep, and it looks like this bug goes back into the dim and distant past.
> At least to the start of modern git history, 2.6.12-rc2.
> 
> > but also in case smp_send_stop() is supposed to have the side-effect
> > of disabling interrupt delivery for the local CPU.
> 
> That can't fix it.  Consider a preemptive non-SMP kernel.
> smp_send_stop() becomes a no-op there.
> 
> I'd suggest that a preemptive UP kernel on x86 hardware will suffer
> this same issue - it will be able to preempt out of this loop and
> continue running userspace.

You're right; I managed to reproduce this locally on my xeon box.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-17 11:05       ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-17 11:05 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: linux-arch, gregkh, Xogium, linux-kernel, mingo, bp, tglx,
	linux-arm-kernel

On Tue, Sep 17, 2019 at 11:51:36AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> > [Expanding CC list; original message is here:
> >  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> > 
> > On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > > On arm64 in some situations userspace will continue running even after a
> > > panic. This means any userspace watchdog daemon will continue pinging,
> > > that service managers will keep running and displaying messages in certain
> > > cases, and that it is possible to enter via ssh in the now unstable system
> > > and to do almost anything except reboot/power off and etc. If
> > > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > > x86/x86_64 the issue does not seem to be present at all.
> > 
> > I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> > The issue is that the infinite loop at the end of panic() can run with
> > preemption enabled (particularly when invoking by echoing 'c' to
> > /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> > doesn't happen because smp_send_stop() disables the local APIC in
> > native_stop_other_cpus() and so interrupts are effectively masked while
> > spinning.
> > 
> > A straightforward fix is to disable preemption explicitly on the panic()
> > path (diff below), but I've expanded the cc list to see both what others
> > think,
> 
> Yep, and it looks like this bug goes back into the dim and distant past.
> At least to the start of modern git history, 2.6.12-rc2.
> 
> > but also in case smp_send_stop() is supposed to have the side-effect
> > of disabling interrupt delivery for the local CPU.
> 
> That can't fix it.  Consider a preemptive non-SMP kernel.
> smp_send_stop() becomes a no-op there.
> 
> I'd suggest that a preemptive UP kernel on x86 hardware will suffer
> this same issue - it will be able to preempt out of this loop and
> continue running userspace.

You're right; I managed to reproduce this locally on my xeon box.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-17 11:05       ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-17 11:05 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: linux-arch, gregkh, Xogium, linux-kernel, mingo, bp, tglx,
	linux-arm-kernel

On Tue, Sep 17, 2019 at 11:51:36AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> > [Expanding CC list; original message is here:
> >  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> > 
> > On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > > On arm64 in some situations userspace will continue running even after a
> > > panic. This means any userspace watchdog daemon will continue pinging,
> > > that service managers will keep running and displaying messages in certain
> > > cases, and that it is possible to enter via ssh in the now unstable system
> > > and to do almost anything except reboot/power off and etc. If
> > > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > > x86/x86_64 the issue does not seem to be present at all.
> > 
> > I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> > The issue is that the infinite loop at the end of panic() can run with
> > preemption enabled (particularly when invoking by echoing 'c' to
> > /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> > doesn't happen because smp_send_stop() disables the local APIC in
> > native_stop_other_cpus() and so interrupts are effectively masked while
> > spinning.
> > 
> > A straightforward fix is to disable preemption explicitly on the panic()
> > path (diff below), but I've expanded the cc list to see both what others
> > think,
> 
> Yep, and it looks like this bug goes back into the dim and distant past.
> At least to the start of modern git history, 2.6.12-rc2.
> 
> > but also in case smp_send_stop() is supposed to have the side-effect
> > of disabling interrupt delivery for the local CPU.
> 
> That can't fix it.  Consider a preemptive non-SMP kernel.
> smp_send_stop() becomes a no-op there.
> 
> I'd suggest that a preemptive UP kernel on x86 hardware will suffer
> this same issue - it will be able to preempt out of this loop and
> continue running userspace.

You're right; I managed to reproduce this locally on my xeon box.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
  2019-09-17 10:45   ` Will Deacon
@ 2019-09-20  4:25     ` Jookia
  -1 siblings, 0 replies; 13+ messages in thread
From: Jookia @ 2019-09-20  4:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: Xogium, linux-arch, gregkh, linux-kernel, linux, mingo, bp, tglx,
	linux-arm-kernel

On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> Hi,
> 
> [Expanding CC list; original message is here:
>  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> 
> On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > On arm64 in some situations userspace will continue running even after a
> > panic. This means any userspace watchdog daemon will continue pinging,
> > that service managers will keep running and displaying messages in certain
> > cases, and that it is possible to enter via ssh in the now unstable system
> > and to do almost anything except reboot/power off and etc. If
> > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > x86/x86_64 the issue does not seem to be present at all.
> 
> I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> The issue is that the infinite loop at the end of panic() can run with
> preemption enabled (particularly when invoking by echoing 'c' to
> /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> doesn't happen because smp_send_stop() disables the local APIC in
> native_stop_other_cpus() and so interrupts are effectively masked while
> spinning.
> 
> A straightforward fix is to disable preemption explicitly on the panic()
> path (diff below), but I've expanded the cc list to see both what others
> think, but also in case smp_send_stop() is supposed to have the side-effect
> of disabling interrupt delivery for the local CPU.
> 
> Will
> 
> --->8
> 
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 057540b6eee9..02d0de31c42d 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -179,6 +179,7 @@ void panic(const char *fmt, ...)
> 	 * after setting panic_cpu) from invoking panic() again.
> 	 */
> 	local_irq_disable();
> +	preempt_disable_notrace();
>  
> 	/*
> 	 * It's possible to come here directly from a panic-assertion and
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

When you run with panic=... it will send you to a loop earlier in the
panic code before local_irq_disable() is hit, working around the bug.
A patch like this would make the behaviour the same:

diff --git a/kernel/panic.c b/kernel/panic.c
index 4d9f55bf7d38..92abbb5f8d38 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -331,7 +331,6 @@ void panic(const char *fmt, ...)

        /* Do not scroll important messages printed above */
        suppress_printk = 1;
-       local_irq_enable();
        for (i = 0; ; i += PANIC_TIMER_STEP) {
                touch_softlockup_watchdog();
                if (i >= i_next) {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-20  4:25     ` Jookia
  0 siblings, 0 replies; 13+ messages in thread
From: Jookia @ 2019-09-20  4:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, gregkh, Xogium, linux-kernel, linux, mingo, bp, tglx,
	linux-arm-kernel

On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> Hi,
> 
> [Expanding CC list; original message is here:
>  https://lore.kernel.org/linux-arm-kernel/BX1W47JXPMR8.58IYW53H6M5N@dragonstone/]
> 
> On Mon, Sep 16, 2019 at 09:35:36PM -0400, Xogium wrote:
> > On arm64 in some situations userspace will continue running even after a
> > panic. This means any userspace watchdog daemon will continue pinging,
> > that service managers will keep running and displaying messages in certain
> > cases, and that it is possible to enter via ssh in the now unstable system
> > and to do almost anything except reboot/power off and etc. If
> > CONFIG_PREEMPT=n is set in the kernel's configuration, the issue is fixed.
> > I have reproduced the very same behavior with linux 4.19, 5.2 and 5.3. On
> > x86/x86_64 the issue does not seem to be present at all.
> 
> I've managed to reproduce this under both 32-bit and 64-bit ARM kernels.
> The issue is that the infinite loop at the end of panic() can run with
> preemption enabled (particularly when invoking by echoing 'c' to
> /proc/sysrq-trigger), so we end up rescheduling user tasks. On x86, this
> doesn't happen because smp_send_stop() disables the local APIC in
> native_stop_other_cpus() and so interrupts are effectively masked while
> spinning.
> 
> A straightforward fix is to disable preemption explicitly on the panic()
> path (diff below), but I've expanded the cc list to see both what others
> think, but also in case smp_send_stop() is supposed to have the side-effect
> of disabling interrupt delivery for the local CPU.
> 
> Will
> 
> --->8
> 
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 057540b6eee9..02d0de31c42d 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -179,6 +179,7 @@ void panic(const char *fmt, ...)
> 	 * after setting panic_cpu) from invoking panic() again.
> 	 */
> 	local_irq_disable();
> +	preempt_disable_notrace();
>  
> 	/*
> 	 * It's possible to come here directly from a panic-assertion and
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

When you run with panic=... it will send you to a loop earlier in the
panic code before local_irq_disable() is hit, working around the bug.
A patch like this would make the behaviour the same:

diff --git a/kernel/panic.c b/kernel/panic.c
index 4d9f55bf7d38..92abbb5f8d38 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -331,7 +331,6 @@ void panic(const char *fmt, ...)

        /* Do not scroll important messages printed above */
        suppress_printk = 1;
-       local_irq_enable();
        for (i = 0; ; i += PANIC_TIMER_STEP) {
                touch_softlockup_watchdog();
                if (i >= i_next) {

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
  2019-09-20  4:25     ` Jookia
@ 2019-09-30 13:53       ` Will Deacon
  -1 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-30 13:53 UTC (permalink / raw)
  To: Jookia
  Cc: Xogium, linux-arch, gregkh, linux-kernel, linux, mingo, bp, tglx,
	linux-arm-kernel

On Fri, Sep 20, 2019 at 02:25:01PM +1000, Jookia wrote:
> On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> > A straightforward fix is to disable preemption explicitly on the panic()
> > path (diff below), but I've expanded the cc list to see both what others
> > think, but also in case smp_send_stop() is supposed to have the side-effect
> > of disabling interrupt delivery for the local CPU.
> > 
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 057540b6eee9..02d0de31c42d 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -179,6 +179,7 @@ void panic(const char *fmt, ...)
> > 	 * after setting panic_cpu) from invoking panic() again.
> > 	 */
> > 	local_irq_disable();
> > +	preempt_disable_notrace();
> >  
> > 	/*
> > 	 * It's possible to come here directly from a panic-assertion and
> > 
> When you run with panic=... it will send you to a loop earlier in the
> panic code before local_irq_disable() is hit, working around the bug.
> A patch like this would make the behaviour the same:
> 
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4d9f55bf7d38..92abbb5f8d38 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -331,7 +331,6 @@ void panic(const char *fmt, ...)
> 
>         /* Do not scroll important messages printed above */
>         suppress_printk = 1;
> -       local_irq_enable();
>         for (i = 0; ; i += PANIC_TIMER_STEP) {
>                 touch_softlockup_watchdog();
>                 if (i >= i_next) {

The reason I kept irqs enabled is because I figured they might be useful
for magic sysrq keyboard interrupts (e.g. if you wanted to reboot the box).

With 'panic=', the reboot happens automatically, so there's no issue there
afaict.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [breakage] panic() does not halt arm64 systems under certain conditions
@ 2019-09-30 13:53       ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2019-09-30 13:53 UTC (permalink / raw)
  To: Jookia
  Cc: linux-arch, gregkh, Xogium, linux-kernel, linux, mingo, bp, tglx,
	linux-arm-kernel

On Fri, Sep 20, 2019 at 02:25:01PM +1000, Jookia wrote:
> On Tue, Sep 17, 2019 at 11:45:19AM +0100, Will Deacon wrote:
> > A straightforward fix is to disable preemption explicitly on the panic()
> > path (diff below), but I've expanded the cc list to see both what others
> > think, but also in case smp_send_stop() is supposed to have the side-effect
> > of disabling interrupt delivery for the local CPU.
> > 
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 057540b6eee9..02d0de31c42d 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -179,6 +179,7 @@ void panic(const char *fmt, ...)
> > 	 * after setting panic_cpu) from invoking panic() again.
> > 	 */
> > 	local_irq_disable();
> > +	preempt_disable_notrace();
> >  
> > 	/*
> > 	 * It's possible to come here directly from a panic-assertion and
> > 
> When you run with panic=... it will send you to a loop earlier in the
> panic code before local_irq_disable() is hit, working around the bug.
> A patch like this would make the behaviour the same:
> 
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 4d9f55bf7d38..92abbb5f8d38 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -331,7 +331,6 @@ void panic(const char *fmt, ...)
> 
>         /* Do not scroll important messages printed above */
>         suppress_printk = 1;
> -       local_irq_enable();
>         for (i = 0; ; i += PANIC_TIMER_STEP) {
>                 touch_softlockup_watchdog();
>                 if (i >= i_next) {

The reason I kept irqs enabled is because I figured they might be useful
for magic sysrq keyboard interrupts (e.g. if you wanted to reboot the box).

With 'panic=', the reboot happens automatically, so there's no issue there
afaict.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-09-30 13:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-17  1:35 [breakage] panic() does not halt arm64 systems under certain conditions Xogium
2019-09-17 10:45 ` Will Deacon
2019-09-17 10:45   ` Will Deacon
2019-09-17 10:51   ` Russell King - ARM Linux admin
2019-09-17 10:51     ` Russell King - ARM Linux admin
2019-09-17 10:51     ` Russell King - ARM Linux admin
2019-09-17 11:05     ` Will Deacon
2019-09-17 11:05       ` Will Deacon
2019-09-17 11:05       ` Will Deacon
2019-09-20  4:25   ` Jookia
2019-09-20  4:25     ` Jookia
2019-09-30 13:53     ` Will Deacon
2019-09-30 13:53       ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.