linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] RAS update for 3.20 (one more thing)
@ 2015-02-02 17:41 Luck, Tony
  2015-02-09  7:53 ` Ingo Molnar
  2015-02-09 10:13 ` Ingo Molnar
  0 siblings, 2 replies; 6+ messages in thread
From: Luck, Tony @ 2015-02-02 17:41 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, linux-edac, Borislav Petkov

The following changes since commit 26bc420b59a38e4e6685a73345a0def461136dce:

  Linux 3.19-rc6 (2015-01-25 20:04:41 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/please-pull-fixmcelog

for you to fetch changes up to 728b6f14abaa7f36a8c6d41c6d6fe0320d32d5e9:

  x86, mce: Kernel does full decoding for AMD, others still need /dev/mcelog reports (2015-01-30 11:25:46 -0800)

----------------------------------------------------------------
Long standing regression - functions registered on the mce decoder
chain can declare that they have completely dealt with an event.
True for AMD (says Boris), since the kernel fully decodes the
machine check bank information. Not true for Intel processors. Full
decode is done in user space, so we need to make the log visible
via /dev/mcelog

----------------------------------------------------------------
Tony Luck (1):
      x86, mce: Kernel does full decoding for AMD, others still need /dev/mcelog reports

 arch/x86/kernel/cpu/mcheck/mce.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] RAS update for 3.20 (one more thing)
  2015-02-02 17:41 [GIT PULL] RAS update for 3.20 (one more thing) Luck, Tony
@ 2015-02-09  7:53 ` Ingo Molnar
  2015-02-09 10:13 ` Ingo Molnar
  1 sibling, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2015-02-09  7:53 UTC (permalink / raw)
  To: Luck, Tony; +Cc: x86, linux-kernel, linux-edac, Borislav Petkov


* Luck, Tony <tony.luck@intel.com> wrote:

> The following changes since commit 26bc420b59a38e4e6685a73345a0def461136dce:
> 
>   Linux 3.19-rc6 (2015-01-25 20:04:41 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/please-pull-fixmcelog
> 
> for you to fetch changes up to 728b6f14abaa7f36a8c6d41c6d6fe0320d32d5e9:
> 
>   x86, mce: Kernel does full decoding for AMD, others still need /dev/mcelog reports (2015-01-30 11:25:46 -0800)
> 
> ----------------------------------------------------------------
> Long standing regression - functions registered on the mce decoder
> chain can declare that they have completely dealt with an event.
> True for AMD (says Boris), since the kernel fully decodes the
> machine check bank information. Not true for Intel processors. Full
> decode is done in user space, so we need to make the log visible
> via /dev/mcelog
> 
> ----------------------------------------------------------------
> Tony Luck (1):
>       x86, mce: Kernel does full decoding for AMD, others still need /dev/mcelog reports
> 
>  arch/x86/kernel/cpu/mcheck/mce.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Pulled into tip:x86/ras, thanks Tony!

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] RAS update for 3.20 (one more thing)
  2015-02-02 17:41 [GIT PULL] RAS update for 3.20 (one more thing) Luck, Tony
  2015-02-09  7:53 ` Ingo Molnar
@ 2015-02-09 10:13 ` Ingo Molnar
  2015-02-09 11:01   ` Borislav Petkov
  1 sibling, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2015-02-09 10:13 UTC (permalink / raw)
  To: Luck, Tony; +Cc: x86, linux-kernel, linux-edac, Borislav Petkov


* Luck, Tony <tony.luck@intel.com> wrote:

> The following changes since commit 26bc420b59a38e4e6685a73345a0def461136dce:
> 
>   Linux 3.19-rc6 (2015-01-25 20:04:41 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/please-pull-fixmcelog
> 
> for you to fetch changes up to 728b6f14abaa7f36a8c6d41c6d6fe0320d32d5e9:
> 
>   x86, mce: Kernel does full decoding for AMD, others still need /dev/mcelog reports (2015-01-30 11:25:46 -0800)
> 
> ----------------------------------------------------------------
> Long standing regression - functions registered on the mce decoder
> chain can declare that they have completely dealt with an event.
> True for AMD (says Boris), since the kernel fully decodes the
> machine check bank information. Not true for Intel processors. Full
> decode is done in user space, so we need to make the log visible
> via /dev/mcelog

So I'm having second thoughts about this:

This kind of vendor specific hard coding is really ugly:

        if (c->x86_vendor == X86_VENDOR_AMD && ret == NOTIFY_STOP)
                return;

Instead we should fix the Intel side to do a proper decode 
as well - by the time mcelog is running it might be too 
late, attempting an intelligent printk is way better...

So what would we need to make the Intel side just as good 
as the AMD side?

I've undone the pull for now.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] RAS update for 3.20 (one more thing)
  2015-02-09 10:13 ` Ingo Molnar
@ 2015-02-09 11:01   ` Borislav Petkov
  2015-02-09 18:00     ` [PATCHv3] x86/mce: Fix regression. All error records should report via /dev/mcelog Tony Luck
  0 siblings, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2015-02-09 11:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Luck, Tony, x86, linux-kernel, linux-edac

On Mon, Feb 09, 2015 at 11:13:16AM +0100, Ingo Molnar wrote:
> This kind of vendor specific hard coding is really ugly:
> 
>         if (c->x86_vendor == X86_VENDOR_AMD && ret == NOTIFY_STOP)
>                 return;

Yeah, I raised the same concern but we already have a bunch of vendor
checks in mce.c anyway so one more shouldn't hurt.

> Instead we should fix the Intel side to do a proper decode 
> as well - by the time mcelog is running it might be too 
> late, attempting an intelligent printk is way better...
> 
> So what would we need to make the Intel side just as good 
> as the AMD side?

The only thing that needs to be done is actually code it, methinks.
Once that's done, we would finally carry ready and decoded information
through a tracepoint to userspace instead of the /dev/mcelog thing.

I'm afraid this won't happen quickly though so we probably would need
this fix for the interim while we're working on the decoding.

Both sides (AMD/Intel) would need to be taught to funnel the decoded
strings through a tracepoint - something which we've talked about in the
past but keeps getting pushed down the TODO list. :-\

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCHv3] x86/mce: Fix regression. All error records should report via /dev/mcelog
  2015-02-09 11:01   ` Borislav Petkov
@ 2015-02-09 18:00     ` Tony Luck
  2015-02-11 18:11       ` Tony Luck
  0 siblings, 1 reply; 6+ messages in thread
From: Tony Luck @ 2015-02-09 18:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: x86, linux-kernel, linux-edac

I'm getting complaints from validation teams that have updated their
Linux kernels from ancient versions to current. They don't see the
error logs they expect. I tell the to unload any EDAC drivers[1], and
things start working again.  The problem is that we short-circuit
the logging process if any function on the decoder chain claims to
have dealt with the problem:

	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
	if (ret == NOTIFY_STOP)
		return;

The logic we used when we added this code was that we did not want
to confuse users with double reports of the same error.

But it turns out users are not confused - they are upset that they
don't see a log where their tools used to find a log.

I could also get into a long description of how the consumer of this
log does more than just decode model specific details of the error.
It keeps counts, tracks thresholds, takes actions and runs scripts
that can alert administrators to problems.

[1] We've recently compounded the problem because the acpi_extlog
driver also registers for this notifier and also returns NOTIFY_STOP.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
v3: Drop v2 changes that included a vendor specific check. Go back to v1.
    New commit message that better describes the problem.

 arch/x86/kernel/cpu/mcheck/mce.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d2c611699cd9..f439c429c133 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -150,14 +150,11 @@ static struct mce_log mcelog = {
 void mce_log(struct mce *mce)
 {
 	unsigned next, entry;
-	int ret = 0;
 
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
-	if (ret == NOTIFY_STOP)
-		return;
+	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
 
 	mce->finished = 0;
 	wmb();
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCHv3] x86/mce: Fix regression. All error records should report via /dev/mcelog
  2015-02-09 18:00     ` [PATCHv3] x86/mce: Fix regression. All error records should report via /dev/mcelog Tony Luck
@ 2015-02-11 18:11       ` Tony Luck
  0 siblings, 0 replies; 6+ messages in thread
From: Tony Luck @ 2015-02-11 18:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: X86-ML, Linux Kernel Mailing List, Linux Edac Mailing List

On Mon, Feb 9, 2015 at 10:00 AM, Tony Luck <tony.luck@intel.com> wrote:
> I'm getting complaints from validation teams that have updated their
> Linux kernels from ancient versions to current. They don't see the
> error logs they expect. I tell the to unload any EDAC drivers[1], and
> things start working again.  The problem is that we short-circuit
> the logging process if any function on the decoder chain claims to
> have dealt with the problem:
>
>         ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>         if (ret == NOTIFY_STOP)
>                 return;
>
> The logic we used when we added this code was that we did not want
> to confuse users with double reports of the same error.
>
> But it turns out users are not confused - they are upset that they
> don't see a log where their tools used to find a log.
>
> I could also get into a long description of how the consumer of this
> log does more than just decode model specific details of the error.
> It keeps counts, tracks thresholds, takes actions and runs scripts
> that can alert administrators to problems.
>
> [1] We've recently compounded the problem because the acpi_extlog
> driver also registers for this notifier and also returns NOTIFY_STOP.

So I see that this missed the x86/ras pull request to Linus.  Will you be doing
another one this merge window?  Or should I just send this direct to Linus?

-Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-02-11 18:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-02 17:41 [GIT PULL] RAS update for 3.20 (one more thing) Luck, Tony
2015-02-09  7:53 ` Ingo Molnar
2015-02-09 10:13 ` Ingo Molnar
2015-02-09 11:01   ` Borislav Petkov
2015-02-09 18:00     ` [PATCHv3] x86/mce: Fix regression. All error records should report via /dev/mcelog Tony Luck
2015-02-11 18:11       ` Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).