All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: <tglx@linutronix.de>, <mingo@redhat.com>, <hpa@zytor.com>,
	<tony.luck@intel.com>, <jiang.liu@linux.intel.com>,
	<yinghai@kernel.org>, <x86@kernel.org>, <dvlasenk@redhat.com>,
	<JBeulich@suse.com>, <slaoub@gmail.com>, <luto@amacapital.net>,
	<dave.hansen@linux.intel.com>, <oleg@redhat.com>,
	<rostedt@goodmis.org>, <rusty@rustcorp.com.au>,
	<prarit@redhat.com>, <linux@rasmusvillemoes.dk>,
	<jroedel@suse.de>, <andriy.shevchenko@linux.intel.com>,
	<macro@linux-mips.org>, <wangnan0@huawei.com>,
	<linux-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org>,
	Robert Richter <rric@kernel.org>
Subject: Re: [PATCH 2/4] x86/mce/amd: Introduce deferred error interrupt handler
Date: Mon, 4 May 2015 12:08:16 -0500	[thread overview]
Message-ID: <5547A780.8080800@amd.com> (raw)
In-Reply-To: <20150504154652.GF3829@pd.tnic>

On 5/4/2015 10:46 AM, Borislav Petkov wrote:
> On Mon, May 04, 2015 at 10:29:50AM -0500, Aravind Gopalakrishnan wrote:
>>> What's the family check for? for BIOSes which don't set the LVT offset
>>> to 2, as they should?
>>>
>>> If so, we probably should say
>>>
>>> 	pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
>>>
>>> or similar...
>> Yeah. I meant to provide a comment at least for this.
>> Forgot to do that.
>>
>> I'll print out a error message as you suggested (considering we do this in
>> other places like threshold setup or IBS setup..)
> lvt_off_valid() does that already. Adding Robert.

Not sure if lvt_off_valid() can be reused for deferred error interrupt 
setup.
It expects some some of  info to be in struct threshold_block which is 
fine for threshold errors and the shifts for offset are different too.

For deferred errors, the workaround is a little different as it applies 
to only the given family/model right now.
If the workaround needs to be applied for future processors, we can 
extend the family check for those right?


>> Right. I think a __log_error() is a good idea.
>> Except, in amd_threshold_interrupt(), we have-
>> m.misc = ((u64)high << 32) | low;
>>
>> which, is actually useless as we don't use m.misc anywhere in
>> amd_decode_mce() or anywhere else in the decoding pipeline AFAICT.
>> We only print out if 'misc' is valid and we only need status bits for that-
>> ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
>>
>> But, more importantly, we don't setup 'm.addr' here (in
>> amd_threshold_interrupt() or in amd_deferred_error_interrupt())
>> Which means anytime we pass an error to be decoded from the interrupt
>> handlers, we don't get any info about the error address.
> So what are we reporting with a deferred error if it is not a
> full-fledged MCE? We better fix that otherwise we probably shouldn't
> even report those. I mean, userspace is supposed to do some error
> handling based on error info but if that info's missing, we might just
> as well panic right then and there, right?

Oh no.. It is a proper MCE.
I was simply saying that when we look at dmesg logs after an error 
happens, we would not see any useful info regarding the error address.
Here's an example-
[ 1314.651485] [Hardware Error]: Deferred error.
[ 1314.651611] [Hardware Error]: CPU:0 (15:60:0) 
MC4_STATUS[Over|CE|MiscV|-|AddrV|Deferred|-|UECC]: 0xdc04b00005080813
[ 1314.651898] [Hardware Error]: MC4 Error Address: 0x0000000000000000
                             ^^^^^^^^^^^^^^^^^^^^

The Error Address will always be logged as 0x0 as m->addr in 
amd_decode_mce() is 0x0.
If we setup 'm.addr' in amd_threshold_interrupt() and 
amd_deferred_error_interrupt() properly, then amd_decode_mce() would 
actually have
some value in m->addr to report.

I didn't mean to say HW doesn't provide us the information in the addr 
and/or the misc registers.

>> So, we can do one of these-
>> 1. Remove m.misc setup in amd_threshold_interrupt() and
>> rdmsrl(MSR_IA32_MCx_ADDR(bank), m.addr) before we call mce_log()
>> 2. Since we have mce_read_aux() that reads misc and addr registers, we can
>> move the mce_[rd|wr]msrl wrappers and mce_read_aux() into mce.h and use it
>> here in mce_amd.c
>>
>> Thoughts?
> Makes sense but you need to first check though, which registers are
> valid in the hw when a threshold/deferred error happens and collect
> them. Only then we can do proper recovery.
>
>

The addr, misc registers are still valid for threshold, deferred errors.
(Of course, misc is valid only if m->status & MCI_STATUS_MISCV)

My point was, in __log_error(), we can read relevant status and addr 
MSRs to be passed to mce_log() as those are the only pieces of 
information we use in the decoding chain; and discard the m.misc 
assignment we do for threshold errors.

If userspace tools absolutely need 'misc' info too, we can go with 
option (2) as mentioned above..

Thanks,
-Aravind.


  reply	other threads:[~2015-05-04 17:08 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-30 14:49 [PATCH 0/4] Enable deferred error interrupts Aravind Gopalakrishnan
2015-04-30 14:49 ` [PATCH 1/4] x86/mce: Define 'SUCCOR' cpuid bit Aravind Gopalakrishnan
2015-05-01 10:25   ` Borislav Petkov
2015-05-01 14:54     ` Aravind Gopalakrishnan
2015-05-03  9:01       ` Borislav Petkov
2015-05-01 15:09   ` Dave Hansen
2015-05-01 16:20     ` Borislav Petkov
2015-04-30 14:49 ` [PATCH 2/4] x86/mce/amd: Introduce deferred error interrupt handler Aravind Gopalakrishnan
2015-04-30 20:41   ` Andy Lutomirski
2015-05-01  4:16     ` Aravind Gopalakrishnan
2015-05-01  9:36       ` Borislav Petkov
2015-05-01 14:50         ` Aravind Gopalakrishnan
2015-05-03  9:22   ` Borislav Petkov
2015-05-04 15:29     ` Aravind Gopalakrishnan
2015-05-04 15:46       ` Borislav Petkov
2015-05-04 17:08         ` Aravind Gopalakrishnan [this message]
2015-05-04 18:46           ` Borislav Petkov
2015-05-04 19:06             ` Aravind Gopalakrishnan
2015-05-04 19:14               ` Borislav Petkov
2015-05-05 18:39             ` Aravind Gopalakrishnan
2015-05-05 20:28               ` Luck, Tony
2015-05-05 20:33                 ` Aravind Gopalakrishnan
2015-04-30 14:49 ` [PATCH 3/4] x86, irq: Cleanup ordering of vector numbers Aravind Gopalakrishnan
2015-04-30 14:49 ` [PATCH 4/4] x86/mce/amd: Rename setup_APIC_mce Aravind Gopalakrishnan
2015-05-01  7:18 ` [PATCH 0/4] Enable deferred error interrupts Ingo Molnar
2015-05-01 14:50   ` Aravind Gopalakrishnan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5547A780.8080800@amd.com \
    --to=aravind.gopalakrishnan@amd.com \
    --cc=JBeulich@suse.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=jroedel@suse.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=luto@amacapital.net \
    --cc=macro@linux-mips.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=prarit@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rric@kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=slaoub@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=wangnan0@huawei.com \
    --cc=x86@kernel.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.