linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Maciej W. Rozycki" <macro@linux-mips.org>
To: Bob Tracy <rct@gherkin.frus.com>
Cc: linux-kernel@vger.kernel.org, debian-alpha@lists.debian.org,
	mcree@orcon.net.nz, jay.estabrook@gmail.com, mattst88@gmail.com
Subject: Re: [BUG] machine check Oops on Alpha
Date: Mon, 18 Apr 2016 02:32:54 +0100 (BST)	[thread overview]
Message-ID: <alpine.LFD.2.20.1604172254540.28036@eddie.linux-mips.org> (raw)
In-Reply-To: <20160417210532.GA27208@gherkin.frus.com>

On Sun, 17 Apr 2016, Bob Tracy wrote:

> While a "machine check" is normally indicative of an underlying hardware
> issue, the fact this is a one-time-per-boot issue has me thinking
> otherwise.  I suspect a code path being traversed prior to the Oops that
> gets bypassed afterward.  As previously mentioned, there have been months-
> long intervals in the past where the issue has either been masked or non-
> existent.  Currently, the issue has persisted through several 4.X kernel
> release candidates and releases.

 It may or may not be a hardware issue it would seem, there's this comment 
in `process_mcheck_info':

	/*
	 * See if the machine check is due to a badaddr() and if so,
	 * ignore it.
	 */

> Attached is an example of precisely what I'm talking about as far as a
> "good" Oops.  It occurred within a day of the last reboot, and the
> machine has been running fine since.  Been flogging the devil out of it,
> too: lots of updates (hundreds of megabytes), kernel builds, etc.

 So from this dump it looks like the immediate problem is not the machine 
check itself but rather a null pointer dereference (offset by 0x10, so 
likely a structure member access):

Unable to handle kernel paging request at virtual address 0000000000000010

which happens at:

pc is at process_mcheck_info+0x54/0x370

and the offending instruction is:

	10 00 89 a2 	ldl	a4,16(s0)

and s0 is indeed null.  To me it looks like we're here:

	printk(KERN_CRIT "%s machine check: vector=0x%lx pc=0x%lx code=0x%x\n",
	       machine, vector, get_irq_regs()->pc, mchk_header->code);

(so not a benign MCE after all) trying to fetch `mchk_header->code', which 
means `la_ptr' is null for some reason.  This value is passed down from 
`cia_machine_check', from `do_entInt', and originally comes from PALcode, 
supposed to point to the logout area.

 The SCB vector, still present in a0 it would seem, is 630, which looks 
legitimate, means "Processor correctable machine check" and is used for 
signalling Istream or Dstream correctable ECC errors.  These are dealt 
with IIUC by PALcode before the machine check is dispatched, which would 
explain why, except for the Oops observed, the system continues to operate 
normally.

 So question is whether it's PALcode doing something weird or is it a 
register getting corrupted due to a bug somewhere, either in our code or 
GCC.  Hmm...

 I'd be tempted to run with the patch below to see what's the value of 
`la_ptr' early on in processing (`entInt' code in entry.S looks sane to 
me, doesn't touch a2).  NB a rebuild doesn't have to be costly if you only 
poke at a single file or a few which aren't e.g. headers included from 
everywhere.

  Maciej

diff --git a/arch/alpha/kernel/irq_alpha.c b/arch/alpha/kernel/irq_alpha.c
index 1c8625c..6773bab 100644
--- a/arch/alpha/kernel/irq_alpha.c
+++ b/arch/alpha/kernel/irq_alpha.c
@@ -46,6 +46,9 @@ do_entInt(unsigned long type, unsigned long vector,
 {
 	struct pt_regs *old_regs;
 
+	if (type == 2)
+		printk(KERN_CRIT "machine check: LA: %016lx\n", la_ptr);
+
 	/*
 	 * Disable interrupts during IRQ handling.
 	 * Note that there is no matching local_irq_enable() due to

  reply	other threads:[~2016-04-18  1:32 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-17 21:05 [BUG] machine check Oops on Alpha Bob Tracy
2016-04-18  1:32 ` Maciej W. Rozycki [this message]
2016-04-18  3:58   ` Bob Tracy
2016-04-18 12:31     ` Bob Tracy
2016-04-18 13:47       ` Maciej W. Rozycki
2016-04-19  2:52         ` Bob Tracy
2016-04-19 23:56           ` Bob Tracy
2016-04-20  0:46             ` Maciej W. Rozycki
2016-04-20  3:57               ` Bob Tracy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.20.1604172254540.28036@eddie.linux-mips.org \
    --to=macro@linux-mips.org \
    --cc=debian-alpha@lists.debian.org \
    --cc=jay.estabrook@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mattst88@gmail.com \
    --cc=mcree@orcon.net.nz \
    --cc=rct@gherkin.frus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).