All of lore.kernel.org
 help / color / mirror / Atom feed
* Cascading crash on ECC error
@ 2016-08-23  8:59 Meelis Roos
  2016-08-23 18:19 ` David Miller
  0 siblings, 1 reply; 2+ messages in thread
From: Meelis Roos @ 2016-08-23  8:59 UTC (permalink / raw)
  To: sparclinux

This happens on my trusty Ultra 5. The root cause seems to be a failing 
DIMM. Where it gets interesting is how this failure is detected and how 
it causes a full crash up to RED state exception.

What should actually happen when uncorrectable memory error happens? If 
this happens from IRQ context, not process context, this should cause 
kernel panic, right?

But why do we detect this error from IRQ context - is it just random or 
do we get an error interrupt and therefore always detect this in IRQ 
context, and always get kernel panic?

Second, why do we get to RED state exceptioin from here?

CPU[0]: Uncorrectable Error AFSR[180300000] AFAR[468980] UDBL[8c000] UDBH[560] TT[a] TL>1[0]
CPU[0]: UDBH Syndrome[48] Memory Module "DIMM1"
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
swapper(0): UE [#1]
CPU: 0 PID: 0 Comm: swapper Tainted: G        W       4.7.0-rc6-dirty #45
task: 0000000000aacd08 ti: 0000000000a9c000 task.ti: 0000000000a9c000
TSTATE: 0000009980e01600 TPC: 0000000000468980 TNPC: 0000000000468990 Y: 00000000    Tainted: G        W      
TPC: <prepare_signal+0x180/0x260>
g0: 0000000000b20400 g1: 0000000000000000 g2: 0000000000000000 g3: 0000000000000000
g4: 0000000000aacd08 g5: 0000000000000000 g6: 0000000000a9c000 g7: 0000000000000000
o0: 0000000000000000 o1: 0000000000000000 o2: 0000000000000000 o3: 0000000000000000
o4: 0000000001833c00 o5: 0000000000af3800 sp: 0000000000a9eb11 ret_pc: 0000000000498f14
RPC: <lock_release+0xd4/0x520>
l0: 000000000000000e l1: 0000000001833c00 l2: 000000000000000e l3: 0000000000b1f000
l4: 0000000000aacd08 l5: 0000000000000018 l6: 0000000000000000 l7: 0000000000000000
i0: 000000000000000e i1: fffff8001f5e00c0 i2: 0000000000000000 i3: fffff8001e657580
i4: 000000000000000d i5: 0000000000b1e000 i6: 0000000000a9ebd1 i7: 0000000000468d2c
I7: <__send_signal.constprop.19+0x4c/0x400>
Call Trace:
 [0000000000468d2c] __send_signal.constprop.19+0x4c/0x400
 [000000000046a088] do_send_sig_info+0x28/0x60
 [000000000046a530] group_send_sig_info+0x150/0x180
 [000000000046a6d8] kill_pid_info+0xd8/0x180
 [00000000004b793c] it_real_fn+0x15c/0x180
 [00000000004b61a0] __hrtimer_run_queues.constprop.21+0x320/0x580
 [00000000004b6edc] hrtimer_interrupt+0x9c/0x1c0
 [000000000095eb88] timer_interrupt+0x68/0xa0
 [0000000000426b7c] sys_call_table+0x5dc/0x760
 [000000000042c454] arch_cpu_idle+0x34/0x80
 [000000000048f924] default_idle_call+0x44/0x60
 [000000000048fb7c] cpu_startup_entry+0x23c/0x320
 [0000000000955c00] rest_init+0x180/0x1a0
 [0000000000b30a20] start_kernel+0x40c/0x41c
 [0000000000b31f00] start_early_boot+0x248/0x258
 [0000000000955a60] tlb_fixup_done+0x4c/0x6c
Caller[0000000000468d2c]: __send_signal.constprop.19+0x4c/0x400
Caller[000000000046a088]: do_send_sig_info+0x28/0x60
Caller[000000000046a530]: group_send_sig_info+0x150/0x180
Caller[000000000046a6d8]: kill_pid_info+0xd8/0x180
Caller[00000000004b793c]: it_real_fn+0x15c/0x180
Caller[00000000004b61a0]: __hrtimer_run_queues.constprop.21+0x320/0x580
Caller[00000000004b6edc]: hrtimer_interrupt+0x9c/0x1c0
Caller[000000000095eb88]: timer_interrupt+0x68/0xa0
Caller[0000000000426b7c]: sys_call_table+0x5dc/0x760
Caller[000000000042c448]: arch_cpu_idle+0x28/0x80
Caller[000000000048f924]: default_idle_call+0x44/0x60
Caller[000000000048fb7c]: cpu_startup_entry+0x23c/0x320
Caller[0000000000955c00]: rest_init+0x180/0x1a0
Caller[0000000000b30a20]: start_kernel+0x40c/0x41c
Caller[0000000000b31f00]: start_early_boot+0x248/0x258
Caller[0000000000955a60]: tlb_fixup_done+0x4c/0x6c
Caller[0000000000000000]:           (null)
Instruction DUMP: 82102012  c45e6480  8530901c <8088a001> 1260002a  82102001  c45e6488  8530901c  8088a001 
Kernel panic - not syncing: Aiee, killing interrupt handler!
Press Stop-A (L1-A) to return to the boot prom
---[ end Kernel panic - not syncing: Aiee, killing interrupt handler!
Kernel unaligned access at TPC[494a30] validate_chain.isra.21+0x7b0/0x1740
Unable to handle kernel NULL pointer dereference in mna handler
 at virtual address 00000000000000da
current->{active_,}mm->context = 00000000000002b5
current->{active_,}mm->pgd = fffff8001e7f6000
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
swapper(0): Oops [#2]
CPU: 0 PID: 0 Comm: swapper Tainted: G      D W       4.7.0-rc6-dirty #45
task: 0000000000aacd08 ti: 0000000000a9c000 task.ti: 0000000000a9c000
TSTATE: 0000009980e01603 TPC: 0000000000494a30 TNPC: 0000000000494a34 Y: 00000000    Tainted: G      D W      
TPC: <validate_chain.isra.21+0x7b0/0x1740>
g0: 0000000000af8918 g1: 0000000000000000 g2: 00000000dead4ead g3: 0000000000000000
g4: 0000000000aacd08 g5: 0000000000000000 g6: 0000000000a9c000 g7: 2c93541a0dde210a
o0: fffff8001f178200 o1: 000000000000000c o2: 0000000000000001 o3: 0000000000000000
o4: 0000000000aaf368 o5: fffff8001f1782a0 sp: 0000000000a9e131 ret_pc: 00000000004600d0
RPC: <irq_exit+0x10/0xc0>
l0: 0000000000b1f000 l1: 0000000000000046 l2: 0000000000000001 l3: 000000000000000a
l4: 0000000000000001 l5: 0000000001827400 l6: 0000000000000000 l7: 0000000000000000
i0: 0000000000000000 i1: 0000000000a9c3f8 i2: 000000000042f628 i3: 0000000000000000
i4: 0000000000af3000 i5: 0000000000aacd08 i6: 0000000000a9e1e1 i7: 000000000095ead0
I7: <handler_irq+0x90/0xe0>
Call Trace:
 [000000000095ead0] handler_irq+0x90/0xe0
 [0000000000426b4c] sys_call_table+0x5ac/0x760
 [000000000042fda4] __delay+0x24/0x60
 [000000000042fdec] udelay+0xc/0x20
 [000000000052ef60] panic+0x260/0x270
 [000000000045e48c] do_exit+0x6c/0xc40

RED State Exception

TL\000.0000.0000.0005 TT\000.0000.0000.0034
   TPC\000.0000.0040.4a40 TnPC\000.0000.0040.4a44 TSTATE\000.0000.8000.1502
TL\000.0000.0000.0004 TT\000.0000.0000.01ff
   TPCÿff.ffff.ffff.fffc TnPCÿff.ffff.ffff.fffc TSTATE\000.00ef.ff0f.f305
TL\000.0000.0000.0003 TT\000.0000.0000.0068
   TPC\000.0000.f000.4e70 TnPC\000.0000.f000.4e74 TSTATE\000.0080.5804.1406
TL\000.0000.0000.0002 TT\000.0000.0000.0010
   TPC\000.0000.0042.0c60 TnPC\000.0000.0042.0c64 TSTATE\000.0000.8000.1502
TL\000.0000.0000.0001 TT\000.0000.0000.0063
   TPC\000.0000.0042.8f48 TnPC\000.0000.0042.8f4c TSTATE\000.0000.8000.1602


-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Cascading crash on ECC error
  2016-08-23  8:59 Cascading crash on ECC error Meelis Roos
@ 2016-08-23 18:19 ` David Miller
  0 siblings, 0 replies; 2+ messages in thread
From: David Miller @ 2016-08-23 18:19 UTC (permalink / raw)
  To: sparclinux

From: Meelis Roos <mroos@linux.ee>
Date: Tue, 23 Aug 2016 13:28:00 +0300 (EEST)

> This happens on my trusty Ultra 5. The root cause seems to be a failing 
> DIMM. Where it gets interesting is how this failure is detected and how 
> it causes a full crash up to RED state exception.
> 
> What should actually happen when uncorrectable memory error happens? If 
> this happens from IRQ context, not process context, this should cause 
> kernel panic, right?
> 
> But why do we detect this error from IRQ context - is it just random or 
> do we get an error interrupt and therefore always detect this in IRQ 
> context, and always get kernel panic?
> 
> Second, why do we get to RED state exceptioin from here?

We're in a hrtimer, that's why we're in an interrupt.  This cpu was
in the idle loop and took a timer interrupt, then tried to deliver
a signal to the user from the timer interrupt.

This one is really hard to recover from, because the address that took
the error was in the area the cpu was executing instructions.

> CPU[0]: Uncorrectable Error AFSR[180300000] AFAR[468980] UDBL[8c000] UDBH[560] TT[a] TL>1[0]

AFAR 0x468980

> TSTATE: 0000009980e01600 TPC: 0000000000468980 TNPC: 0000000000468990 Y: 00000000    Tainted: G        W      

TPC 0x468980

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-08-23 18:19 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23  8:59 Cascading crash on ECC error Meelis Roos
2016-08-23 18:19 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.