All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] machine check Oops on Alpha
@ 2016-04-17 21:05 Bob Tracy
  2016-04-18  1:32 ` Maciej W. Rozycki
  0 siblings, 1 reply; 9+ messages in thread
From: Bob Tracy @ 2016-04-17 21:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: debian-alpha, mcree, jay.estabrook, mattst88

[-- Attachment #1: Type: text/plain, Size: 2004 bytes --]

Apologies in advance for the "poor" quality of this bug report.  No idea
how to proceed, because the issue historically has been intermittent to
non-existant for reasons unknown.

Within 24 hours of booting my Alpha (PWS 433au), I'm pretty much
guaranteed to see a "machine check" Oops which typically will occur
during a period of high disk activity (for example, during an "apt-get
update / upgrade".  If I want a huge mess to clean up afterward, "git
pull" on the kernel source tree will generally suffice as well :-(.

As long as the "Oops" trace doesn't include evidence of filesystem write
activity (calls to ext3/4 functions), the machine is perfectly stable
afterward for as long as I care to let it run -- days, weeks, whatever
-- no further Oopses will occur, regardless of how hard I flog the
machine.  A "bad" Oops will cause an immediate system lockup if any
process attempts to access the region of disk that was active at the
time the Oops occurred.

While a "machine check" is normally indicative of an underlying hardware
issue, the fact this is a one-time-per-boot issue has me thinking
otherwise.  I suspect a code path being traversed prior to the Oops that
gets bypassed afterward.  As previously mentioned, there have been months-
long intervals in the past where the issue has either been masked or non-
existent.  Currently, the issue has persisted through several 4.X kernel
release candidates and releases.

Attached is an example of precisely what I'm talking about as far as a
"good" Oops.  It occurred within a day of the last reboot, and the
machine has been running fine since.  Been flogging the devil out of it,
too: lots of updates (hundreds of megabytes), kernel builds, etc.

While any and all help tracking this down will be appreciated, please
know that kernel rebuilds (to turn on debugging or for whatever reason)
are an overnight affair on this system.  In other words, turnaround time
on diagnostic iterations involving kernel modifications will be slow.

--Bob

[-- Attachment #2: good_oops --]
[-- Type: text/plain, Size: 3715 bytes --]

Apr  9 21:40:15 smirkin kernel: Unable to handle kernel paging request at virtual address 0000000000000010
Apr  9 21:40:15 smirkin kernel: dpkg-deb(19404): Oops 0
Apr  9 21:40:15 smirkin kernel: pc = [<fffffc0000316174>]  ra = [<fffffc000031df78>]  ps = 0007    Not tainted
Apr  9 21:40:15 smirkin kernel: pc is at process_mcheck_info+0x54/0x370
Apr  9 21:40:15 smirkin kernel: ra is at cia_machine_check+0x98/0xb0
Apr  9 21:40:15 smirkin kernel: v0 = 0000000000000004  t0 = 0000000000000000  t1 = 0000000000000001
Apr  9 21:40:15 smirkin kernel: t2 = 0000000000000630  t3 = fffffc0000d405f0  t4 = fffffc0000acf166
Apr  9 21:40:15 smirkin kernel: t5 = 00000000001fffff  t6 = 00000000ffffffff  t7 = fffffc005cf38000
Apr  9 21:40:15 smirkin kernel: s0 = 0000000000000000  s1 = fffffc0000c61750  s2 = 0000000000000000
Apr  9 21:40:15 smirkin kernel: s3 = 0000000000000000  s4 = fffffc0000cbcef0  s5 = fffffc0000d405d0
Apr  9 21:40:15 smirkin kernel: s6 = fffffc0000c7ef70
Apr  9 21:40:15 smirkin kernel: a0 = 0000000000000630  a1 = fffffc0000aca965  a2 = 0000000000000630
Apr  9 21:40:15 smirkin kernel: a3 = 0000000000000000  a4 = 0000000000000000  a5 = 0000000000000000
Apr  9 21:40:15 smirkin kernel: t8 = 000000000000001f  t9 = fffffc0000acbb38  t10= fffffc0000d40608
Apr  9 21:40:15 smirkin kernel: t11= 0000000000000000  pv = fffffc0000316120  at = 0000000000800000
Apr  9 21:40:15 smirkin kernel: gp = fffffc0000cabb38  sp = fffffc005cf3b978
Apr  9 21:40:15 smirkin kernel: Disabling lock debugging due to kernel taint
Apr  9 21:40:15 smirkin kernel: Trace:
Apr  9 21:40:15 smirkin kernel: [<fffffc000031df78>] cia_machine_check+0x98/0xb0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000316100>] do_entInt+0x1c0/0x1e0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000311340>] ret_from_sys_call+0x0/0x10
Apr  9 21:40:15 smirkin kernel: [<fffffc0000398ea4>] get_page_from_freelist+0x504/0xa10
Apr  9 21:40:15 smirkin kernel: [<fffffc00005aa410>] clear_page+0x0/0xc4
Apr  9 21:40:15 smirkin kernel: [<fffffc00005aa428>] clear_page+0x18/0xc4
Apr  9 21:40:15 smirkin kernel: [<fffffc000039949c>] __alloc_pages_nodemask+0xec/0xa00
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b70a0>] wp_page_copy.isra.100+0x3c0/0x620
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b6d3c>] wp_page_copy.isra.100+0x5c/0x620
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b8828>] do_wp_page.isra.102+0x128/0x640
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b8758>] do_wp_page.isra.102+0x58/0x640
Apr  9 21:40:15 smirkin kernel: [<fffffc000036377c>] current_fs_time+0x4c/0x70
Apr  9 21:40:15 smirkin kernel: [<fffffc00003bac6c>] handle_mm_fault+0x73c/0x1180
Apr  9 21:40:15 smirkin kernel: [<fffffc00003bb4f8>] handle_mm_fault+0xfc8/0x1180
Apr  9 21:40:15 smirkin kernel: [<fffffc000036bbe0>] timekeeping_update+0x130/0x200
Apr  9 21:40:15 smirkin kernel: [<fffffc0000365790>] hrtimer_run_queues+0x50/0x210
Apr  9 21:40:15 smirkin kernel: [<fffffc000031ec30>] do_page_fault+0x150/0x500
Apr  9 21:40:15 smirkin kernel: [<fffffc00003bde68>] find_vma+0x28/0xc0
Apr  9 21:40:15 smirkin kernel: [<fffffc000031ebb4>] do_page_fault+0xd4/0x500
Apr  9 21:40:15 smirkin kernel: [<fffffc00003734fc>] tick_periodic.constprop.17+0x3c/0xc0
Apr  9 21:40:15 smirkin kernel: [<fffffc000031eb9c>] do_page_fault+0xbc/0x500
Apr  9 21:40:15 smirkin kernel: [<fffffc0000328244>] __do_softirq+0x184/0x310
Apr  9 21:40:15 smirkin kernel: [<fffffc0000310f7c>] entMM+0x9c/0xc0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000315e8c>] handle_irq+0x8c/0xf0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000315f9c>] do_entInt+0x5c/0x1e0
Apr  9 21:40:15 smirkin kernel: 
Apr  9 21:40:15 smirkin kernel: Code: a53e0008  a55e0010  23de0020  6bfa8001  a55de018  47f00412 <a2890010> 261dffe2 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-17 21:05 [BUG] machine check Oops on Alpha Bob Tracy
@ 2016-04-18  1:32 ` Maciej W. Rozycki
  2016-04-18  3:58   ` Bob Tracy
  0 siblings, 1 reply; 9+ messages in thread
From: Maciej W. Rozycki @ 2016-04-18  1:32 UTC (permalink / raw)
  To: Bob Tracy; +Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Sun, 17 Apr 2016, Bob Tracy wrote:

> While a "machine check" is normally indicative of an underlying hardware
> issue, the fact this is a one-time-per-boot issue has me thinking
> otherwise.  I suspect a code path being traversed prior to the Oops that
> gets bypassed afterward.  As previously mentioned, there have been months-
> long intervals in the past where the issue has either been masked or non-
> existent.  Currently, the issue has persisted through several 4.X kernel
> release candidates and releases.

 It may or may not be a hardware issue it would seem, there's this comment 
in `process_mcheck_info':

	/*
	 * See if the machine check is due to a badaddr() and if so,
	 * ignore it.
	 */

> Attached is an example of precisely what I'm talking about as far as a
> "good" Oops.  It occurred within a day of the last reboot, and the
> machine has been running fine since.  Been flogging the devil out of it,
> too: lots of updates (hundreds of megabytes), kernel builds, etc.

 So from this dump it looks like the immediate problem is not the machine 
check itself but rather a null pointer dereference (offset by 0x10, so 
likely a structure member access):

Unable to handle kernel paging request at virtual address 0000000000000010

which happens at:

pc is at process_mcheck_info+0x54/0x370

and the offending instruction is:

	10 00 89 a2 	ldl	a4,16(s0)

and s0 is indeed null.  To me it looks like we're here:

	printk(KERN_CRIT "%s machine check: vector=0x%lx pc=0x%lx code=0x%x\n",
	       machine, vector, get_irq_regs()->pc, mchk_header->code);

(so not a benign MCE after all) trying to fetch `mchk_header->code', which 
means `la_ptr' is null for some reason.  This value is passed down from 
`cia_machine_check', from `do_entInt', and originally comes from PALcode, 
supposed to point to the logout area.

 The SCB vector, still present in a0 it would seem, is 630, which looks 
legitimate, means "Processor correctable machine check" and is used for 
signalling Istream or Dstream correctable ECC errors.  These are dealt 
with IIUC by PALcode before the machine check is dispatched, which would 
explain why, except for the Oops observed, the system continues to operate 
normally.

 So question is whether it's PALcode doing something weird or is it a 
register getting corrupted due to a bug somewhere, either in our code or 
GCC.  Hmm...

 I'd be tempted to run with the patch below to see what's the value of 
`la_ptr' early on in processing (`entInt' code in entry.S looks sane to 
me, doesn't touch a2).  NB a rebuild doesn't have to be costly if you only 
poke at a single file or a few which aren't e.g. headers included from 
everywhere.

  Maciej

diff --git a/arch/alpha/kernel/irq_alpha.c b/arch/alpha/kernel/irq_alpha.c
index 1c8625c..6773bab 100644
--- a/arch/alpha/kernel/irq_alpha.c
+++ b/arch/alpha/kernel/irq_alpha.c
@@ -46,6 +46,9 @@ do_entInt(unsigned long type, unsigned long vector,
 {
 	struct pt_regs *old_regs;
 
+	if (type == 2)
+		printk(KERN_CRIT "machine check: LA: %016lx\n", la_ptr);
+
 	/*
 	 * Disable interrupts during IRQ handling.
 	 * Note that there is no matching local_irq_enable() due to

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-18  1:32 ` Maciej W. Rozycki
@ 2016-04-18  3:58   ` Bob Tracy
  2016-04-18 12:31     ` Bob Tracy
  0 siblings, 1 reply; 9+ messages in thread
From: Bob Tracy @ 2016-04-18  3:58 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Mon, Apr 18, 2016 at 02:32:54AM +0100, Maciej W. Rozycki wrote:
>  I'd be tempted to run with the patch below to see what's the value of 
> `la_ptr' early on in processing (`entInt' code in entry.S looks sane to 
> me, doesn't touch a2).  NB a rebuild doesn't have to be costly if you only 
> poke at a single file or a few which aren't e.g. headers included from 
> everywhere.

Applied.  Build started.  Report to follow in a day or so: I've applied
other patches to my kernel source tree in the meantime, so a full build
is unavoidable at this point...  I'll hold off applying any updates
after this to minimize what must be rebuilt while this issue is being
worked.  Thank you for your time and trouble!

--Bob

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-18  3:58   ` Bob Tracy
@ 2016-04-18 12:31     ` Bob Tracy
  2016-04-18 13:47       ` Maciej W. Rozycki
  0 siblings, 1 reply; 9+ messages in thread
From: Bob Tracy @ 2016-04-18 12:31 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Sun, Apr 17, 2016 at 10:58:48PM -0500, Bob Tracy wrote:
> On Mon, Apr 18, 2016 at 02:32:54AM +0100, Maciej W. Rozycki wrote:
> >  I'd be tempted to run with the patch below to see what's the value of 
> > `la_ptr' early on in processing (`entInt' code in entry.S looks sane to 
> > me, doesn't touch a2).  NB a rebuild doesn't have to be costly if you only 
> > poke at a single file or a few which aren't e.g. headers included from 
> > everywhere.
> 
> Applied.  Build started.  Report to follow in a day or so: I've applied
> other patches to my kernel source tree in the meantime, so a full build
> is unavoidable at this point...  I'll hold off applying any updates
> after this to minimize what must be rebuilt while this issue is being
> worked.  Thank you for your time and trouble!

Build delayed slightly.  Ran into "fs/binfmt_em86.o" build failure
patched by Daniel Wagner back in February (incompatible-pointer-types
warning treated as error by compiler).  Is Daniel's patch queued for
incorporation into the main kernel source tree?

--Bob

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-18 12:31     ` Bob Tracy
@ 2016-04-18 13:47       ` Maciej W. Rozycki
  2016-04-19  2:52         ` Bob Tracy
  0 siblings, 1 reply; 9+ messages in thread
From: Maciej W. Rozycki @ 2016-04-18 13:47 UTC (permalink / raw)
  To: Bob Tracy; +Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Mon, 18 Apr 2016, Bob Tracy wrote:

> Build delayed slightly.  Ran into "fs/binfmt_em86.o" build failure
> patched by Daniel Wagner back in February (incompatible-pointer-types
> warning treated as error by compiler).  Is Daniel's patch queued for
> incorporation into the main kernel source tree?

 No idea.  I've had a peek at the patch though and it groups unrelated 
changes together and also mixes obvious semantics fixes (missing `const' 
qualifier) with semantic changes (`i_arg' removal) which may need further 
consideration.  I think splitting that proposal into ~3 self-contained 
changes may rise the likelihood of at least the critical parts being 
accepted.

  Maciej

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-18 13:47       ` Maciej W. Rozycki
@ 2016-04-19  2:52         ` Bob Tracy
  2016-04-19 23:56           ` Bob Tracy
  0 siblings, 1 reply; 9+ messages in thread
From: Bob Tracy @ 2016-04-19  2:52 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Mon, Apr 18, 2016 at 02:47:40PM +0100, Maciej W. Rozycki wrote:
> On Mon, 18 Apr 2016, Bob Tracy wrote:
> 
> > Build delayed slightly.  Ran into "fs/binfmt_em86.o" build failure
> > patched by Daniel Wagner back in February (incompatible-pointer-types
> > warning treated as error by compiler).  Is Daniel's patch queued for
> > incorporation into the main kernel source tree?
> 
> No idea.  I've had a peek at the patch though and it groups unrelated 
> changes together and also mixes obvious semantics fixes (missing `const' 
> qualifier) with semantic changes (`i_arg' removal) which may need further 
> consideration.  I think splitting that proposal into ~3 self-contained 
> changes may rise the likelihood of at least the critical parts being 
> accepted.

4.6.0-rc4 build complete, including suggested (by Alan Young) "Verbose
Machine Checks" option set to level 2 by default.  System rebooted, and
now we wait...  Thanks for everyone's continued patience.

--Bob

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-19  2:52         ` Bob Tracy
@ 2016-04-19 23:56           ` Bob Tracy
  2016-04-20  0:46             ` Maciej W. Rozycki
  0 siblings, 1 reply; 9+ messages in thread
From: Bob Tracy @ 2016-04-19 23:56 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On Mon, Apr 18, 2016 at 09:52:43PM -0500, Bob Tracy wrote:
> 4.6.0-rc4 build complete, including suggested (by Alan Young) "Verbose
> Machine Checks" option set to level 2 by default.  System rebooted, and
> now we wait...  Thanks for everyone's continued patience.

Within three minutes of rebooting, I got a machine check, but perhaps
significantly, no "Oops".  I'm guessing the only reason I'm seeing the
ECC errors now (haven't seen them before) is because of the stepped-up
debug output.  Syslog output attached...

Machine has been stable since the machine check.  Kernel is 4.6.0-rc4.

--Bob

[-- Attachment #2: machine_check --]
[-- Type: text/plain, Size: 5586 bytes --]

Apr 18 23:51:29 smirkin kernel: machine check: LA: fffffc0000006000
Apr 18 23:51:29 smirkin kernel: CIA machine check NOT expected!!!
Apr 18 23:51:29 smirkin kernel: CIA machine check: vector=0x630 pc=0xfffffc00005b66ac code=0x86
Apr 18 23:51:29 smirkin kernel: machine check type: correctable ECC error (retryable)
Apr 18 23:51:29 smirkin kernel: pc = [<fffffc00005b66ac>]  ra = [<fffffc00003a5324>]  ps = 0000    Not tainted
Apr 18 23:51:29 smirkin kernel: pc is at clear_page+0x3c/0xc4
Apr 18 23:51:29 smirkin kernel: ra is at get_page_from_freelist+0x524/0xa20
Apr 18 23:51:29 smirkin kernel: v0 = 0000000000000007  t0 = 6db6db6db6db6db7  t1 = fffffc0000000000
Apr 18 23:51:29 smirkin kernel: t2 = fffffc005c3786e0  t3 = fffffc0000d445f0  t4 = fffffc0000ad0a38
Apr 18 23:51:29 smirkin kernel: t5 = 00000000001fffff  t6 = 00000000ffffffff  t7 = fffffc005cf54000
Apr 18 23:51:29 smirkin kernel: a0 = fffffc00025ede40  a1 = 0000000000000000  a2 = 000000000000032e
Apr 18 23:51:29 smirkin kernel: a3 = fffffc0000c83a50  a4 = 0000000000000000  a5 = 0000000000000000
Apr 18 23:51:29 smirkin kernel: t8 = fffffc0000ad0aa0  t9 = fffffc0000ad0aa0  t10= fffffc0000d44608
Apr 18 23:51:29 smirkin kernel: t11= 0000000000000000  pv = fffffc00005b6670  at = 0000000000000400
Apr 18 23:51:29 smirkin kernel: gp = fffffc0000cb0aa0  sp = fffffc005cf57ba0
Apr 18 23:51:29 smirkin kernel:   +       0 8000000000000068 0000003800000018
Apr 18 23:51:29 smirkin kernel:   +      10 0000000000000086 ffffff00025edddf
Apr 18 23:51:29 smirkin kernel:   +      20 0000000000005700 fffffff0c5ffffff
Apr 18 23:51:29 smirkin kernel:   +      30 0000000100000000 0000000000000000
Apr 18 23:51:29 smirkin kernel:   +      40 0000000000000000 0000000000000000
Apr 18 23:51:29 smirkin kernel:   +      50 0000000000000000 0000000000000000
Apr 18 23:51:29 smirkin kernel:   +      60 0000000000000000 00000000000002c0
Apr 18 23:51:30 smirkin kernel: machine check: LA: fffffc0000006000
Apr 18 23:51:30 smirkin kernel: CIA machine check NOT expected!!!
Apr 18 23:51:30 smirkin kernel: CIA machine check: vector=0x630 pc=0x1200124c8 code=0x86
Apr 18 23:51:30 smirkin kernel: machine check type: correctable ECC error (retryable)
Apr 18 23:51:30 smirkin kernel: pc = [<00000001200124c8>]  ra = [<0000000120012480>]  ps = 0008    Not tainted
Apr 18 23:51:30 smirkin kernel: pc is at 0x1200124c8
Apr 18 23:51:30 smirkin kernel: ra is at 0x120012480
Apr 18 23:51:30 smirkin kernel: v0 = 0000000120a51c90  t0 = 0000000000000000  t1 = 0000000000000001
Apr 18 23:51:30 smirkin kernel: t2 = 0000000120a530f0  t3 = 0000000120255638  t4 = 0000000000000000
Apr 18 23:51:30 smirkin kernel: t5 = 0000000120255620  t6 = 0000000000000001  t7 = 0000000000000020
Apr 18 23:51:30 smirkin kernel: a0 = 0000000120a51d40  a1 = 0000000000000000  a2 = 0000000000000008
Apr 18 23:51:30 smirkin kernel: a3 = 0000020000038010  a4 = 0000000000000000  a5 = 000000000000003e
Apr 18 23:51:30 smirkin kernel: t8 = 0000000000000001  t9 = 1999999999999999  t10= 0000000000000000
Apr 18 23:51:30 smirkin kernel: t11= 0000000000000001  pv = 0000000000000000  at = 0000000000000400
Apr 18 23:51:30 smirkin kernel: gp = 00000001200440f0  sp = fffffc005cf58000
Apr 18 23:51:30 smirkin kernel:   +       0 8000000000000068 0000003800000018
Apr 18 23:51:30 smirkin kernel:   +      10 0000000000000086 ffffff00025edddf
Apr 18 23:51:30 smirkin kernel:   +      20 0000000000005700 fffffff0c5ffffff
Apr 18 23:51:30 smirkin kernel:   +      30 0000000100000000 0000000000000000
Apr 18 23:51:30 smirkin kernel:   +      40 0000000000000000 0000000000000000
Apr 18 23:51:30 smirkin kernel:   +      50 0000000000000000 0000000000000000
Apr 18 23:51:30 smirkin kernel:   +      60 0000000000000000 00000000000002c0
Apr 18 23:51:30 smirkin kernel: machine check: LA: fffffc0000006000
Apr 18 23:51:30 smirkin kernel: CIA machine check NOT expected!!!
Apr 18 23:51:30 smirkin kernel: CIA machine check: vector=0x630 pc=0x120007138 code=0x86
Apr 18 23:51:30 smirkin kernel: machine check type: correctable ECC error (retryable)
Apr 18 23:51:30 smirkin kernel: pc = [<0000000120007138>]  ra = [<000000012001253c>]  ps = 0008    Not tainted
Apr 18 23:51:30 smirkin kernel: pc is at 0x120007138
Apr 18 23:51:30 smirkin kernel: ra is at 0x12001253c
Apr 18 23:51:30 smirkin kernel: v0 = 0000000120a51c90  t0 = 0000000000000000  t1 = 0000000000000001
Apr 18 23:51:30 smirkin kernel: t2 = 0000000120a530f0  t3 = 0000000120255638  t4 = 0000000000000000
Apr 18 23:51:30 smirkin kernel: t5 = 0000000120255620  t6 = 0000000000000001  t7 = 0000000000000020
Apr 18 23:51:30 smirkin kernel: a0 = 0000000000000000  a1 = 0000000000000000  a2 = 0000000000000008
Apr 18 23:51:30 smirkin kernel: a3 = 0000020000038010  a4 = 0000000000000000  a5 = 000000000000003e
Apr 18 23:51:30 smirkin kernel: t8 = 0000000000000001  t9 = 1999999999999999  t10= 0000000000000000
Apr 18 23:51:30 smirkin kernel: t11= 0000000000000001  pv = 0000000000000000  at = 0000000000000400
Apr 18 23:51:30 smirkin kernel: gp = 00000001200440f0  sp = fffffc005cf58000
Apr 18 23:51:30 smirkin kernel:   +       0 8000000000000068 0000003800000018
Apr 18 23:51:30 smirkin kernel:   +      10 0000000000000086 ffffff00025edddf
Apr 18 23:51:30 smirkin kernel:   +      20 0000000000005700 fffffff0c5ffffff
Apr 18 23:51:30 smirkin kernel:   +      30 0000000100000000 0000000000000000
Apr 18 23:51:30 smirkin kernel:   +      40 0000000000000000 0000000000000000
Apr 18 23:51:30 smirkin kernel:   +      50 0000000000000000 0000000000000000
Apr 18 23:51:30 smirkin kernel:   +      60 0000000000000000 00000000000002c0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-19 23:56           ` Bob Tracy
@ 2016-04-20  0:46             ` Maciej W. Rozycki
  2016-04-20  3:57               ` Bob Tracy
  0 siblings, 1 reply; 9+ messages in thread
From: Maciej W. Rozycki @ 2016-04-20  0:46 UTC (permalink / raw)
  To: Bob Tracy; +Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Tue, 19 Apr 2016, Bob Tracy wrote:

> > 4.6.0-rc4 build complete, including suggested (by Alan Young) "Verbose
> > Machine Checks" option set to level 2 by default.  System rebooted, and
> > now we wait...  Thanks for everyone's continued patience.
> 
> Within three minutes of rebooting, I got a machine check, but perhaps
> significantly, no "Oops".  I'm guessing the only reason I'm seeing the
> ECC errors now (haven't seen them before) is because of the stepped-up
> debug output.  Syslog output attached...

 If this is a code generation bug, which I now suspect even more highly 
than before, then the debug verbosity configuration change may well have 
made the compiler behave indeed.  As you can see from the log the logout 
area pointer is not null:

machine check: LA: fffffc0000006000

(of course the lone insertion of this `printk' call may have covered the 
bug, regardless of the debug verbosity change).  Consequently further 
information is printed -- the:

CIA machine check: vector=0x630 pc=0xfffffc00005b66ac code=0x86

line would have been printed anyway -- in fact the Oops previously 
happened in an attempt to retrieve `code' to print with this line.

 I can see if I can find anything suspicious there if you send me original 
copies (i.e. those that oopsed) of arch/alpha/kernel/irq_alpha.o and 
arch/alpha/kernel/core_cia.o.

> Machine has been stable since the machine check.  Kernel is 4.6.0-rc4.

 Yeah, it was a correctable error after all.

  Maciej

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] machine check Oops on Alpha
  2016-04-20  0:46             ` Maciej W. Rozycki
@ 2016-04-20  3:57               ` Bob Tracy
  0 siblings, 0 replies; 9+ messages in thread
From: Bob Tracy @ 2016-04-20  3:57 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: linux-kernel, debian-alpha, mcree, jay.estabrook, mattst88

On Wed, Apr 20, 2016 at 01:46:13AM +0100, Maciej W. Rozycki wrote:
>  I can see if I can find anything suspicious there if you send me original 
> copies (i.e. those that oopsed) of arch/alpha/kernel/irq_alpha.o and 
> arch/alpha/kernel/core_cia.o.
> 
> > Machine has been stable since the machine check.  Kernel is 4.6.0-rc4.
> 
>  Yeah, it was a correctable error after all.

:-)

Regrettably, the constituent object files of the 4.5.0 kernel that was
generating the "Oopses" are no longer available.  I *do* have the
"vmlinux.gz" image, the corresponding "System.map-4.5.0", and all the
related modules if those would be of any use.  With a bit of guidance, I
could probably extract the desired objects from the kernel image.
Alternatively, if there's an upload location where I could leave you the
image and map files, that might work as well.

Pending your reply, I'll see if I can figure out how to dump/extract the
requested object code from the kernel image file.

--Bob

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-04-20  3:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-17 21:05 [BUG] machine check Oops on Alpha Bob Tracy
2016-04-18  1:32 ` Maciej W. Rozycki
2016-04-18  3:58   ` Bob Tracy
2016-04-18 12:31     ` Bob Tracy
2016-04-18 13:47       ` Maciej W. Rozycki
2016-04-19  2:52         ` Bob Tracy
2016-04-19 23:56           ` Bob Tracy
2016-04-20  0:46             ` Maciej W. Rozycki
2016-04-20  3:57               ` Bob Tracy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.