All of lore.kernel.org
 help / color / mirror / Atom feed
* Hardware Error Kernel Mini-Summit
@ 2010-05-17 18:23 Mauro Carvalho Chehab
  2010-05-17 22:41 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-17 18:23 UTC (permalink / raw)
  To: Linux Kernel Mailing List, bluesmoke-devel, Linux Edac Mailing List
  Cc: Thomas Gleixner, Ingo Molnar, Ben Woodard, Matt Domsch,
	Doug Thompson, Borislav Petkov, Tony Luck, Brent Young

During the last LF Collaboration Summit, we've done a mini-summit [1],
intended to improve the hardware error detection in kernel, currently
provided by MCE and EDAC subsystems.

The idea of this mini-summit came up after Thomas Gleixner and Ingo
Molnar suggestions that edac and mce should converge into an error
subsystem.

I'm enclosing the minutes of the meeting, in order to allow it to be
reviewed by other kernel hackers that are interested on the theme but
unfortunately couldn't come to the meeting.

Btw, during the meeting, it were decided that EDAC ML could better work
if moved to vger, so I'm copying here both the old and the new edac
mailing lists.

[1] http://events.linuxfoundation.org/lfcs2010/edac

---


	I Hardware Error Kernel Mini-Summit
	===================================
				April, 15 - San Francisco, CA, US
				2010 Linux Foundation Collaboration Summit

Attendees:
	Ben Woodard   - Red Hat
	Brent Young   - Intel
	Doug Thompson - LLNL
	Mark Grondona - LLNL
	Matt Domsch   - Dell
	Mauro Chehab  - Red Hat
	Tony Luck     - Intel

After some initial description of the current state of error
handling in Linux, we moved to work on requirements and high
level design of the system going forward.

Requirements
============

First and foremost is that end-users (presumably system administrators)
need notification of the hardware component that is the source of each
error.  Ideally this should include "silk screen" markings so that the
user can identify which component is at fault.

Other requirements may vary amongst different types of end users, but
include:
+ Minimal disruption to system performance when logging corrected errors.
+ Assurance that h/w error detection mechanisms are correctly configured
  and enabled.
+ System topology information

Wit respect to System topology, it was poined that LLNL is concerned about
being sure that ECC is enabled, as some BIOS'es lied about that in the past.
Also, memory topology information is needed to allow matching the silk 
screen labels from the hardware with each reported DIMM.


Design - dividing the problem into logical layers
=================================================

It was agreed that a hardware error driver should be mapped on those layers:

	[Userspace API]
	[core layer]
	[Low level drivers]

At the lowest level is the task of collecting error information from
hardware and/or firmware (LLNL call this "harvesting").  There is
already wide diversity between architectures, and even platforms
within the same architecture. So it make sense for there to be some
low level, platform specific, drivers that collect data in whatever
way they can - and present it to some "core" layer in the kernel
that will provide abstraction/uniformity to higher levels of the
software stack.

Matt Domsch pointed out that the 4.0a version of the ACPI specification
includes a chapter on "APEI" - features drawn from the WHEA (Windows
Hardware Error Architecture). These features are already implemented
by some BIOS vendors - and will become more widespread.  When this
is present (and determined to be correctly implemented) it makes
sense to use this for error harvesting. When it isn't (e.g. on
architectures not graced with ACPI support) a more traditional
memory controller driver that reads chipset registers can be used
instead.

In the middle layer - we just waved our hands a bit and said that there
was some generic core code.  Doug volunteered to re-factor other code
to create this.

The middle layer should provide ways to map some number of parameters to
a FRU (field replaceable unit). 
	f(a,b,c,d) => FRU
some examples could be:
        f( CPU socket, MC, channel, DIMM) => memory FRU
        f( phy addr) => memory FRU
        f( ? ) => processor FRU

The FRU is the most important thing from the customer or system
administrator's perspective. For example, they are not terribly
interested in the memory hierarchy or the way that the machine is put
together, that is really only interesting to hardware engineers. After
the machine is bought, when they have an error they are most concerned
with finding the component that they need to replace to get the machine
fully operational again.

Some of the challenges that make this mapping difficult are:
Mirrored memory
     1. Hot spare memory
     2. Logical vs. physical DIMMs
     3. Interleaved memory
     4. complicated memory like mainframes have.
     5. Uncertainty in where the error actually is. In other words there 
are times when the most that the hardware is able to know is that the 
problem appears to be on DIMM[1-3]. How to portray that to the user is 
one topic that was never resolved.
     6. Right now the memory controller is the only one that has this
information because it had to know this information when it setup the
memory controller and this information is currently only available to
BIOS writers.

The FRU needs to be broader than just DIMMs, it can be any field 
replaceable component. Some examples beginning with 
/sys/devices/system/EDAC/:
     * UP machine:        ...MC/DIMM[0-4]/CSROW
     * SMP machine: .      ..MC/MC[x]/DIMM[0-4]/CSROW
     * Nehalem EP, AMD:    ..MC/MC[x]/CHAN[y]/DIMM
      * Nehalem EX:          NODE[z]/MC[x]/CHAN[y]/RISER[a]/DIMM

In these directories there will be at least an attribute named 'ce' and 
'ue' attributes 

The uppper layer presents error data to user-space.  The EDAC model of
a forest of /sys attribute files based on csrows within DIMMs used to
provide both error counts for each object as well as topology information.

Andi Kleen's /dev/mcelog has met with strong opposition from Ingo Molnar
and Thomas Gleixner.  They suggested that the "performance event" mechanism
has already been extended to report numerous non-performance related
kernel events - and it would be a logical extension to include hardware
error events too.  LLNL representatives said they could code to any
reporting methodology.

Some discussion on how performance events are managed by the kernel
and the options available to user programs to register interest in
events followed.  One potential challenge is that the kernel hooks
that log events will silently drop events when there are no processes
registered to collect data.  This will require some cleverness to work
out how to log data from fatal errors that caused a system reboot - as
these must be discovered and cleared from hardware registers before any
userspace code is running.

Other notes
===========
The ACPI4.0a specification also documents an error injection interface.
When supported in a BIOS (and, as usual, assuming it is correctly
implemented) this should allow for more widespread testing of the
error handling code.  It may provide some limited assurance that
hardware error detection features are enabled.

Both HPC (High Performance Computing) and FSA (Financial Services) users
have performance requirements that are intolerant of interruption by
SMI code running in the BIOS.  When these interruptions extend for
milli-seconds: cluster performance suffers or trade executions miss
the profit window.  Going forward there are rules/recommendations
(I'm not sure how strict) that SMI interrupts should limit themselves
to 200 micro seconds.  It is unclear whether SMI that adds this much
latency to the error harvesting overhead will cause noticeable problems
to either HPC or FSA customers. The lack of clarity is mostly because
h/w error logging is just one of many reasons why a processor may take
an SMI interrupt, and it is unknown whether hardware errors would make
up a significant percentage of total SMI events.

There is an immediate need for error reporting on NHM-EP class systems.
Mauro will work on cleaning up his EDAC code for these to be included
in some RHEL 5.x update.  Less certainty on whether this will be suitable
for 6.x series.

In the specific case of Nehalem-EX, it seems that the low level driver
won't be able to use direct access to the memory controller registers, 
since the uncore now uses a register index/value pair to read or write 
from the memory controller. The same pair is also used by BIOS to control 
the hardware. With this design, race conditions between BIOS and the OS 
may happen, So, even reading data from the Memory Controller registers 
is not possible. So, it will need to use some logic to communicate via 
BIOS, probably via ACPI 4.0 APEI.

The Bluesmoke mailing list hosted at sourceforge has been overrun by
spammers.  Doug will talk to admins at vger.kernel.org to ask them to
host a new list there.

Next Steps
==========

It as agreed that the next steps will be:

1) Write a summary of the meeting - responsible: Ben Woodward;

2) After having the summary reviewed, produce an email to be sent to to 
LKML, in order to get upstream comments, especially from Thomas and 
Ingo - responsible: Mauro Carvalho Chehab;

3) Write EDAC core changes - responsible: Doug Thompson (EDAC maintainer);

4) Port i7core_edac (Nehalem and Nehalem-EP) to the new EDAC core structs 
(the name of the driver will likely be changed, as it works not only with 
i7 core chips) - Responsible: Mauro Carvalho Chehab;

5) Write a driver for Nehalem-EX using the new EDAC core - Responsible: 
Tony Luck.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
@ 2010-05-17 22:41 ` Andi Kleen
  2010-05-18 16:50   ` Mauro Carvalho Chehab
  2010-05-18  6:52   ` Hidetoshi Seto
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 108+ messages in thread
From: Andi Kleen @ 2010-05-17 22:41 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

Mauro Carvalho Chehab <mchehab@redhat.com> writes:
>
> There is an immediate need for error reporting on NHM-EP class
> systems.

Just for the innocent readers who might be mislead by this:

Nehalem-EP DIMM error accounting already works fine today using
mcelog for most cases, including RHEL5.5 (with some limits) 
and RHEL6beta with no additional changes needed.

In RHEL6 the daemon does the accounting and the client reports the errors
separated for each DIMM and separated in uc and ce.  In RHEL5
the information is in a log file and can be gotten from there.

In addition the daemon supports various advanced RAS features including
predictive bad page offlining and various threshold triggers.

> In the specific case of Nehalem-EX, it seems that the low level driver
> won't be able to use direct access to the memory controller registers, 
> since the uncore now uses a register index/value pair to read or write 
> from the memory controller. The same pair is also used by BIOS to control 
> the hardware. With this design, race conditions between BIOS and the OS 
> may happen, So, even reading data from the Memory Controller registers 
> is not possible. So, it will need to use some logic to communicate via 
> BIOS, probably via ACPI 4.0 APEI.

Already done too, see
http://permalink.gmane.org/gmane.linux.acpi.devel/45743

However the interface won't give you the topology you're asking
for, just the errors.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
@ 2010-05-18  6:52   ` Hidetoshi Seto
  2010-05-18  6:52   ` Hidetoshi Seto
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Hidetoshi Seto @ 2010-05-18  6:52 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

(2010/05/18 3:23), Mauro Carvalho Chehab wrote:
> During the last LF Collaboration Summit, we've done a mini-summit [1],
> intended to improve the hardware error detection in kernel, currently
> provided by MCE and EDAC subsystems.
> 
> The idea of this mini-summit came up after Thomas Gleixner and Ingo
> Molnar suggestions that edac and mce should converge into an error
> subsystem.
> 
> I'm enclosing the minutes of the meeting, in order to allow it to be
> reviewed by other kernel hackers that are interested on the theme but
> unfortunately couldn't come to the meeting.
> 
> Btw, during the meeting, it were decided that EDAC ML could better work
> if moved to vger, so I'm copying here both the old and the new edac
> mailing lists.
> 
> [1] http://events.linuxfoundation.org/lfcs2010/edac
> 
> ---

Thank you very much for providing this report.

I agree that we should have a well organized error subsystem that
covers all error sources in the system and that provides enough
simple and powerful API for users. As one of interested absentee,
I think I could be of some help to you (e.g. x86 low level).

It might be off-topic here, but I'd like to point that you missed
the presence of PCIe AER subsystem that handle hardware errors on
PCIe devices nowadays (It works well on ppc, x86 and so on).
Given that APEI also covers PCIe errors and that some system can
map MC registers to PCI configuration space, I think there is no
way for the new error subsystem to ignore I/O device errors while
it care errors on CPU/memory and cooperate with APEI.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18  6:52   ` Hidetoshi Seto
  0 siblings, 0 replies; 108+ messages in thread
From: Hidetoshi Seto @ 2010-05-18  6:52 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Tony Luck, Brent Young, Linux Kernel Mailing List,
	Borislav Petkov, Ingo Molnar, Matt Domsch, Doug Thompson,
	Thomas Gleixner, bluesmoke-devel, Linux Edac Mailing List

(2010/05/18 3:23), Mauro Carvalho Chehab wrote:
> During the last LF Collaboration Summit, we've done a mini-summit [1],
> intended to improve the hardware error detection in kernel, currently
> provided by MCE and EDAC subsystems.
> 
> The idea of this mini-summit came up after Thomas Gleixner and Ingo
> Molnar suggestions that edac and mce should converge into an error
> subsystem.
> 
> I'm enclosing the minutes of the meeting, in order to allow it to be
> reviewed by other kernel hackers that are interested on the theme but
> unfortunately couldn't come to the meeting.
> 
> Btw, during the meeting, it were decided that EDAC ML could better work
> if moved to vger, so I'm copying here both the old and the new edac
> mailing lists.
> 
> [1] http://events.linuxfoundation.org/lfcs2010/edac
> 
> ---

Thank you very much for providing this report.

I agree that we should have a well organized error subsystem that
covers all error sources in the system and that provides enough
simple and powerful API for users. As one of interested absentee,
I think I could be of some help to you (e.g. x86 low level).

It might be off-topic here, but I'd like to point that you missed
the presence of PCIe AER subsystem that handle hardware errors on
PCIe devices nowadays (It works well on ppc, x86 and so on).
Given that APEI also covers PCIe errors and that some system can
map MC registers to PCI configuration space, I think there is no
way for the new error subsystem to ignore I/O device errors while
it care errors on CPU/memory and cooperate with APEI.


Thanks,
H.Seto


------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
@ 2010-05-18 13:06   ` Borislav Petkov
  2010-05-18  6:52   ` Hidetoshi Seto
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-05-18 13:06 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Tony Luck, Brent Young, Ingo Molnar,
	Matt Domsch, Doug Thompson, Thomas Gleixner

From: Mauro Carvalho Chehab <mchehab@redhat.com>
Date: Mon, May 17, 2010 at 02:23:17PM -0400

> Btw, during the meeting, it were decided that EDAC ML could better work
> if moved to vger, so I'm copying here both the old and the new edac
> mailing lists.

Have the old list subscribers been moved to the new list or do they
need to re-subscribe?

Thanks.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 13:06   ` Borislav Petkov
  0 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-05-18 13:06 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Tony Luck, Brent Young, Ingo Molnar,
	Matt Domsch, Doug Thompson, Thomas Gleixner

From: Mauro Carvalho Chehab <mchehab@redhat.com>
Date: Mon, May 17, 2010 at 02:23:17PM -0400

> Btw, during the meeting, it were decided that EDAC ML could better work
> if moved to vger, so I'm copying here both the old and the new edac
> mailing lists.

Have the old list subscribers been moved to the new list or do they
need to re-subscribe?

Thanks.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18  6:52   ` Hidetoshi Seto
@ 2010-05-18 16:44     ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 16:44 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

Hidetoshi Seto wrote:
> (2010/05/18 3:23), Mauro Carvalho Chehab wrote:
>> During the last LF Collaboration Summit, we've done a mini-summit [1],
>> intended to improve the hardware error detection in kernel, currently
>> provided by MCE and EDAC subsystems.
>>
>> The idea of this mini-summit came up after Thomas Gleixner and Ingo
>> Molnar suggestions that edac and mce should converge into an error
>> subsystem.
>>
>> I'm enclosing the minutes of the meeting, in order to allow it to be
>> reviewed by other kernel hackers that are interested on the theme but
>> unfortunately couldn't come to the meeting.
>>
>> Btw, during the meeting, it were decided that EDAC ML could better work
>> if moved to vger, so I'm copying here both the old and the new edac
>> mailing lists.
>>
>> [1] http://events.linuxfoundation.org/lfcs2010/edac
>>
>> ---
> 
> Thank you very much for providing this report.
> 
> I agree that we should have a well organized error subsystem that
> covers all error sources in the system and that provides enough
> simple and powerful API for users. As one of interested absentee,
> I think I could be of some help to you (e.g. x86 low level).

Thank you for your offer. Any help is welcome.
>
> It might be off-topic here, but I'd like to point that you missed
> the presence of PCIe AER subsystem that handle hardware errors on
> PCIe devices nowadays (It works well on ppc, x86 and so on).
> Given that APEI also covers PCIe errors and that some system can
> map MC registers to PCI configuration space, I think there is no
> way for the new error subsystem to ignore I/O device errors while
> it care errors on CPU/memory and cooperate with APEI.

Yes, it makes sense to integrate also PCIe AER subystem. IMO, the first
step is to provide an error core integrated to perf, and then start
integrating the several error systems around it.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 16:44     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 16:44 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Tony Luck, Brent Young, Linux Kernel Mailing List,
	Borislav Petkov, Ingo Molnar, Matt Domsch, Doug Thompson,
	Thomas Gleixner, bluesmoke-devel, Linux Edac Mailing List

Hidetoshi Seto wrote:
> (2010/05/18 3:23), Mauro Carvalho Chehab wrote:
>> During the last LF Collaboration Summit, we've done a mini-summit [1],
>> intended to improve the hardware error detection in kernel, currently
>> provided by MCE and EDAC subsystems.
>>
>> The idea of this mini-summit came up after Thomas Gleixner and Ingo
>> Molnar suggestions that edac and mce should converge into an error
>> subsystem.
>>
>> I'm enclosing the minutes of the meeting, in order to allow it to be
>> reviewed by other kernel hackers that are interested on the theme but
>> unfortunately couldn't come to the meeting.
>>
>> Btw, during the meeting, it were decided that EDAC ML could better work
>> if moved to vger, so I'm copying here both the old and the new edac
>> mailing lists.
>>
>> [1] http://events.linuxfoundation.org/lfcs2010/edac
>>
>> ---
> 
> Thank you very much for providing this report.
> 
> I agree that we should have a well organized error subsystem that
> covers all error sources in the system and that provides enough
> simple and powerful API for users. As one of interested absentee,
> I think I could be of some help to you (e.g. x86 low level).

Thank you for your offer. Any help is welcome.
>
> It might be off-topic here, but I'd like to point that you missed
> the presence of PCIe AER subsystem that handle hardware errors on
> PCIe devices nowadays (It works well on ppc, x86 and so on).
> Given that APEI also covers PCIe errors and that some system can
> map MC registers to PCI configuration space, I think there is no
> way for the new error subsystem to ignore I/O device errors while
> it care errors on CPU/memory and cooperate with APEI.

Yes, it makes sense to integrate also PCIe AER subystem. IMO, the first
step is to provide an error core integrated to perf, and then start
integrating the several error systems around it.

-- 

Cheers,
Mauro

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-17 22:41 ` Andi Kleen
@ 2010-05-18 16:50   ` Mauro Carvalho Chehab
  2010-05-18 18:10       ` Andi Kleen
  0 siblings, 1 reply; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 16:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

Andi Kleen wrote:
> Mauro Carvalho Chehab <mchehab@redhat.com> writes:
>> There is an immediate need for error reporting on NHM-EP class
>> systems.
> 
> Just for the innocent readers who might be mislead by this:
> 
> Nehalem-EP DIMM error accounting already works fine today using
> mcelog for most cases, including RHEL5.5 (with some limits) 
> and RHEL6beta with no additional changes needed.
> 
> In RHEL6 the daemon does the accounting and the client reports the errors
> separated for each DIMM and separated in uc and ce.  In RHEL5
> the information is in a log file and can be gotten from there.
> 
> In addition the daemon supports various advanced RAS features including
> predictive bad page offlining and various threshold triggers.

Ok. It should be clear that the main target of the mini-summit is to define
how the several subsystems will integrate into a hardware-abstracted way
to report errors from kernel. So, we're looking on the next steps to improve
what we currently have, and avoid to have more than one different subsystem
trying to get the same info, eventually using the same registers, but providing
different interfaces to userspace.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 13:06   ` Borislav Petkov
@ 2010-05-18 16:52     ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 16:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Tony Luck, Brent Young, Ingo Molnar,
	Matt Domsch, Doug Thompson, Thomas Gleixner

Borislav Petkov wrote:
> From: Mauro Carvalho Chehab <mchehab@redhat.com>
> Date: Mon, May 17, 2010 at 02:23:17PM -0400
> 
>> Btw, during the meeting, it were decided that EDAC ML could better work
>> if moved to vger, so I'm copying here both the old and the new edac
>> mailing lists.
> 
> Have the old list subscribers been moved to the new list or do they
> need to re-subscribe?

I suspect that you'll need to subscribe on the new ML.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 16:52     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 16:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Brent Young, Linux Kernel Mailing List, Ingo Molnar,
	Matt Domsch, Doug Thompson, Thomas Gleixner, bluesmoke-devel,
	Linux Edac Mailing List

Borislav Petkov wrote:
> From: Mauro Carvalho Chehab <mchehab@redhat.com>
> Date: Mon, May 17, 2010 at 02:23:17PM -0400
> 
>> Btw, during the meeting, it were decided that EDAC ML could better work
>> if moved to vger, so I'm copying here both the old and the new edac
>> mailing lists.
> 
> Have the old list subscribers been moved to the new list or do they
> need to re-subscribe?

I suspect that you'll need to subscribe on the new ML.

-- 

Cheers,
Mauro

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
@ 2010-05-18 17:06   ` Mauro Carvalho Chehab
  2010-05-18  6:52   ` Hidetoshi Seto
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 17:06 UTC (permalink / raw)
  To: Linux Kernel Mailing List, bluesmoke-devel, Linux Edac Mailing List
  Cc: Thomas Gleixner, Ingo Molnar, Ben Woodard, Matt Domsch,
	Doug Thompson, Borislav Petkov, Tony Luck, Brent Young

The current i7core_edac driver is ready for merge upstream, using the current
edac API. It supports the following processor families:
	i7core, i5core, Lynnfield, Nehalem, Nehalem-EP and Westmere-EP

The tree is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core.git linux_next

This driver doesn't support Nehalem-EX. From the discussions we had during
the mini-summit, the MCU of -EX family is very different, so, a separate
driver will be required for it.

Please review. My plan is to submit this driver for upstream merge this week.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 17:06   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 17:06 UTC (permalink / raw)
  To: Linux Kernel Mailing List, bluesmoke-devel, Linux Edac Mailing List
  Cc: Tony Luck, Brent Young, Borislav Petkov, Ingo Molnar,
	Matt Domsch, Doug Thompson, Thomas Gleixner

The current i7core_edac driver is ready for merge upstream, using the current
edac API. It supports the following processor families:
	i7core, i5core, Lynnfield, Nehalem, Nehalem-EP and Westmere-EP

The tree is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core.git linux_next

This driver doesn't support Nehalem-EX. From the discussions we had during
the mini-summit, the MCU of -EX family is very different, so, a separate
driver will be required for it.

Please review. My plan is to submit this driver for upstream merge this week.

-- 

Cheers,
Mauro

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 16:44     ` Mauro Carvalho Chehab
  (?)
@ 2010-05-18 17:42     ` Joe Perches
  2010-05-18 17:59       ` Mauro Carvalho Chehab
                         ` (2 more replies)
  -1 siblings, 3 replies; 108+ messages in thread
From: Joe Perches @ 2010-05-18 17:42 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Hidetoshi Seto, Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

On Tue, 2010-05-18 at 13:44 -0300, Mauro Carvalho Chehab wrote:
> IMO, the first
> step is to provide an error core integrated to perf, and then start
> integrating the several error systems around it.

Why integrated to perf?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 17:42     ` Joe Perches
@ 2010-05-18 17:59       ` Mauro Carvalho Chehab
  2010-05-18 18:45       ` Andi Kleen
  2010-05-18 18:53       ` Ingo Molnar
  2 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-18 17:59 UTC (permalink / raw)
  To: Joe Perches
  Cc: Hidetoshi Seto, Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

Joe Perches wrote:
> On Tue, 2010-05-18 at 13:44 -0300, Mauro Carvalho Chehab wrote:
>> IMO, the first
>> step is to provide an error core integrated to perf, and then start
>> integrating the several error systems around it.
> 
> Why integrated to perf?

That's the original plan. It were suggested by Ingo and Thomas at LKML. Borislav
also send a more technical proposal about it.

It actually makes sense, since some sorts of errors may affect performance
(those non-fatal errors that are auto-recovered). Also, using debugfs and the
same kind of logic used by perf to filter errors seems pertinent for hardware 
errors. As the actual patches were not written yet, the details on how those 
things will integrate will depend on further analysis.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 16:50   ` Mauro Carvalho Chehab
@ 2010-05-18 18:10       ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-18 18:10 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Andi Kleen, Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov,
	Tony Luck, Brent Young

On Tue, May 18, 2010 at 01:50:36PM -0300, Mauro Carvalho Chehab wrote:
> Ok. It should be clear that the main target of the mini-summit is to define
> how the several subsystems will integrate into a hardware-abstracted way
> to report errors from kernel. So, we're looking on the next steps to improve
> what we currently have, and avoid to have more than one different subsystem
> trying to get the same info, eventually using the same registers, but providing
> different interfaces to userspace.

Well there are different use cases.

mcelog mainly deals in thresholds (including fancy ones like 
per page and per object thresholds) and events and actions to thresholds
(= more events), all your proposals are dealing with objects counts currently.

It does per object counting too, but only incidentially.

I suspect there are use cases for both, although I personally suspect
for most people events, thresholds and their actions are the most useful
thing to handle by default. But one size doesn't fit all.

Anyways it boils down you need different interfaces for different things.

For example there will be always events versus accounting. 

You can synthesize accounting from events (that is what mcelog
does today). The other way round does not work so well unfortunately,
or at least would be rather inefficient.

Also large parts of the actions can be only usefully done in user space, so 
you need a user space component.

I am somewhat biased of course but I think mcelog is doing a reasonable
good job today at being this user space component. It definitely
has areas that could be improved too, but at lot of the basics
are there and doing ok.

In principle mcelog could feed from another API too, but it would
definitely prefer to not to have to poll it or having to parse
printks.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 18:10       ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-18 18:10 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Tony Luck, Brent Young, Linux Kernel Mailing List,
	Borislav Petkov, bluesmoke-devel, Andi Kleen, Matt Domsch,
	Doug Thompson, Thomas Gleixner, Ingo Molnar,
	Linux Edac Mailing List

On Tue, May 18, 2010 at 01:50:36PM -0300, Mauro Carvalho Chehab wrote:
> Ok. It should be clear that the main target of the mini-summit is to define
> how the several subsystems will integrate into a hardware-abstracted way
> to report errors from kernel. So, we're looking on the next steps to improve
> what we currently have, and avoid to have more than one different subsystem
> trying to get the same info, eventually using the same registers, but providing
> different interfaces to userspace.

Well there are different use cases.

mcelog mainly deals in thresholds (including fancy ones like 
per page and per object thresholds) and events and actions to thresholds
(= more events), all your proposals are dealing with objects counts currently.

It does per object counting too, but only incidentially.

I suspect there are use cases for both, although I personally suspect
for most people events, thresholds and their actions are the most useful
thing to handle by default. But one size doesn't fit all.

Anyways it boils down you need different interfaces for different things.

For example there will be always events versus accounting. 

You can synthesize accounting from events (that is what mcelog
does today). The other way round does not work so well unfortunately,
or at least would be rather inefficient.

Also large parts of the actions can be only usefully done in user space, so 
you need a user space component.

I am somewhat biased of course but I think mcelog is doing a reasonable
good job today at being this user space component. It definitely
has areas that could be improved too, but at lot of the basics
are there and doing ok.

In principle mcelog could feed from another API too, but it would
definitely prefer to not to have to poll it or having to parse
printks.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 17:42     ` Joe Perches
  2010-05-18 17:59       ` Mauro Carvalho Chehab
@ 2010-05-18 18:45       ` Andi Kleen
  2010-05-18 18:57         ` Joe Perches
  2010-05-18 18:53       ` Ingo Molnar
  2 siblings, 1 reply; 108+ messages in thread
From: Andi Kleen @ 2010-05-18 18:45 UTC (permalink / raw)
  To: Joe Perches
  Cc: Mauro Carvalho Chehab, Hidetoshi Seto, Linux Kernel Mailing List,
	bluesmoke-devel, Linux Edac Mailing List, Thomas Gleixner,
	Ingo Molnar, Ben Woodard, Matt Domsch, Doug Thompson,
	Borislav Petkov, Tony Luck, Brent Young

Joe Perches <joe@perches.com> writes:

> On Tue, 2010-05-18 at 13:44 -0300, Mauro Carvalho Chehab wrote:
>> IMO, the first
>> step is to provide an error core integrated to perf, and then start
>> integrating the several error systems around it.
>
> Why integrated to perf?

For a different perspective on this see also

http://permalink.gmane.org/gmane.linux.kernel/952061

AFAIK all the issues mentioned there are still open.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 17:42     ` Joe Perches
  2010-05-18 17:59       ` Mauro Carvalho Chehab
  2010-05-18 18:45       ` Andi Kleen
@ 2010-05-18 18:53       ` Ingo Molnar
  2010-05-18 19:08           ` Luck, Tony
  2 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 18:53 UTC (permalink / raw)
  To: Joe Perches
  Cc: Mauro Carvalho Chehab, Hidetoshi Seto, Linux Kernel Mailing List,
	bluesmoke-devel, Linux Edac Mailing List, Thomas Gleixner,
	Ingo Molnar, Ben Woodard, Matt Domsch, Doug Thompson,
	Borislav Petkov, Tony Luck, Brent Young


* Joe Perches <joe@perches.com> wrote:

> On Tue, 2010-05-18 at 13:44 -0300, Mauro Carvalho Chehab wrote:
> > IMO, the first
> > step is to provide an error core integrated to perf, and then start
> > integrating the several error systems around it.
> 
> Why integrated to perf?

It makes sense to use the kernel's performance events 
logging framework when we are logging events about how the 
system performs.

Furthermore it's NMI safe, offers structured logging, has 
various streaming, multiplexing and filtering capabilities 
that come handy for RAS purposes and more.

The other option would be to use an ad-hoc logging 
implementation, only used for EDAC/RAS, which couldnt be 
mixed with other system events. That approach has various 
obvious disadvanteges so we are aiming for a unified 
approach.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 18:45       ` Andi Kleen
@ 2010-05-18 18:57         ` Joe Perches
  0 siblings, 0 replies; 108+ messages in thread
From: Joe Perches @ 2010-05-18 18:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mauro Carvalho Chehab, Hidetoshi Seto, Linux Kernel Mailing List,
	bluesmoke-devel, Linux Edac Mailing List, Thomas Gleixner,
	Ingo Molnar, Ben Woodard, Matt Domsch, Doug Thompson,
	Borislav Petkov, Tony Luck, Brent Young

On Tue, 2010-05-18 at 20:45 +0200, Andi Kleen wrote:
> Joe Perches <joe@perches.com> writes:
> > On Tue, 2010-05-18 at 13:44 -0300, Mauro Carvalho Chehab wrote:
> >> IMO, the first
> >> step is to provide an error core integrated to perf, and then start
> >> integrating the several error systems around it.
> > Why integrated to perf?
> For a different perspective on this see also
> http://permalink.gmane.org/gmane.linux.kernel/952061
> 
> AFAIK all the issues mentioned there are still open.

Yup, that's why I asked for explanation.

What was offered lacks useful goal and design detail.

perf is a cool tool, but probably not necessary for
for a HW error reporting system.



^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: Hardware Error Kernel Mini-Summit
  2010-05-18 18:53       ` Ingo Molnar
@ 2010-05-18 19:08           ` Luck, Tony
  0 siblings, 0 replies; 108+ messages in thread
From: Luck, Tony @ 2010-05-18 19:08 UTC (permalink / raw)
  To: Ingo Molnar, Joe Perches
  Cc: Mauro Carvalho Chehab, Hidetoshi Seto, Linux Kernel Mailing List,
	bluesmoke-devel, Linux Edac Mailing List, Thomas Gleixner,
	Ingo Molnar, Ben Woodard, Matt Domsch, Doug Thompson,
	Borislav Petkov, Young, Brent

> It makes sense to use the kernel's performance events 
> logging framework when we are logging events about how the 
> system performs.

Perhaps it makes more sense to say that the Linux "performance
events logging framework" has become more generic and is really
now an "event logging framework".

> Furthermore it's NMI safe, offers structured logging, has 
> various streaming, multiplexing and filtering capabilities 
> that come handy for RAS purposes and more.

Those of us present at the mini-summit were not familiar with
all the features available. One area of concern was how to be
sure that something is in fact listening to and logging the
error events.  My understanding is that if there is no process
attached to an event, the kernel will just drop it.  This is
of particular concern because the kernel's first scan of the
machine check banks occurs before there are any processes.
So errors found early in boot (which might be saved fatal
errors from before the boot) might be lost.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: Hardware Error Kernel Mini-Summit
@ 2010-05-18 19:08           ` Luck, Tony
  0 siblings, 0 replies; 108+ messages in thread
From: Luck, Tony @ 2010-05-18 19:08 UTC (permalink / raw)
  To: Ingo Molnar, Joe Perches
  Cc: Mauro Carvalho Chehab, Hidetoshi Seto, Linux Kernel Mailing List,
	bluesmoke-devel, Linux Edac Mailing List, Thomas Gleixner,
	Ingo Molnar, Ben Woodard, Matt Domsch, Doug Thompson,
	Borislav Petkov, Young, Brent

> It makes sense to use the kernel's performance events 
> logging framework when we are logging events about how the 
> system performs.

Perhaps it makes more sense to say that the Linux "performance
events logging framework" has become more generic and is really
now an "event logging framework".

> Furthermore it's NMI safe, offers structured logging, has 
> various streaming, multiplexing and filtering capabilities 
> that come handy for RAS purposes and more.

Those of us present at the mini-summit were not familiar with
all the features available. One area of concern was how to be
sure that something is in fact listening to and logging the
error events.  My understanding is that if there is no process
attached to an event, the kernel will just drop it.  This is
of particular concern because the kernel's first scan of the
machine check banks occurs before there are any processes.
So errors found early in boot (which might be saved fatal
errors from before the boot) might be lost.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 19:08           ` Luck, Tony
@ 2010-05-18 19:18             ` Borislav Petkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-05-18 19:18 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Ingo Molnar, Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Young, Brent

From: "Luck, Tony" <tony.luck@intel.com>
Date: Tue, May 18, 2010 at 03:08:58PM -0400

> > It makes sense to use the kernel's performance events 
> > logging framework when we are logging events about how the 
> > system performs.
> 
> Perhaps it makes more sense to say that the Linux "performance
> events logging framework" has become more generic and is really
> now an "event logging framework".

Yep, that's the idea.

> > Furthermore it's NMI safe, offers structured logging, has 
> > various streaming, multiplexing and filtering capabilities 
> > that come handy for RAS purposes and more.
> 
> Those of us present at the mini-summit were not familiar with
> all the features available. One area of concern was how to be
> sure that something is in fact listening to and logging the
> error events.  My understanding is that if there is no process
> attached to an event, the kernel will just drop it.  This is
> of particular concern because the kernel's first scan of the
> machine check banks occurs before there are any processes.
> So errors found early in boot (which might be saved fatal
> errors from before the boot) might be lost.

Well, we have a trace_mce_record tracepoint in the mcheck code which
calls all the necessary callbacks when an mcheck occurs. For the time
being, the idea is to use the mce.c ring buffer for early mchecks and
copy them to the regular ftrace per-cpu buffer after the last has been
initialized. Later, we could switch to a another early bootmem buffer if
there's need to.

Also, we want to have a userspace daemon that reads out the mces from
the trace buffer and does further processing like thresholding etc in
userspace.

Concerning critical errors, there we bypass the perf subsystem and
execute the smallest amount of code possible while trying to shutdown
gracefully if the error type allows that.

These are the rough ideas at least...

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 19:18             ` Borislav Petkov
  0 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-05-18 19:18 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Ingo Molnar, Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Young, Brent

From: "Luck, Tony" <tony.luck@intel.com>
Date: Tue, May 18, 2010 at 03:08:58PM -0400

> > It makes sense to use the kernel's performance events 
> > logging framework when we are logging events about how the 
> > system performs.
> 
> Perhaps it makes more sense to say that the Linux "performance
> events logging framework" has become more generic and is really
> now an "event logging framework".

Yep, that's the idea.

> > Furthermore it's NMI safe, offers structured logging, has 
> > various streaming, multiplexing and filtering capabilities 
> > that come handy for RAS purposes and more.
> 
> Those of us present at the mini-summit were not familiar with
> all the features available. One area of concern was how to be
> sure that something is in fact listening to and logging the
> error events.  My understanding is that if there is no process
> attached to an event, the kernel will just drop it.  This is
> of particular concern because the kernel's first scan of the
> machine check banks occurs before there are any processes.
> So errors found early in boot (which might be saved fatal
> errors from before the boot) might be lost.

Well, we have a trace_mce_record tracepoint in the mcheck code which
calls all the necessary callbacks when an mcheck occurs. For the time
being, the idea is to use the mce.c ring buffer for early mchecks and
copy them to the regular ftrace per-cpu buffer after the last has been
initialized. Later, we could switch to a another early bootmem buffer if
there's need to.

Also, we want to have a userspace daemon that reads out the mces from
the trace buffer and does further processing like thresholding etc in
userspace.

Concerning critical errors, there we bypass the perf subsystem and
execute the smallest amount of code possible while trying to shutdown
gracefully if the error type allows that.

These are the rough ideas at least...

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 19:08           ` Luck, Tony
@ 2010-05-18 19:30             ` Ingo Molnar
  -1 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 19:30 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent


* Luck, Tony <tony.luck@intel.com> wrote:

> > It makes sense to use the kernel's performance events 
> > logging framework when we are logging events about how the 
> > system performs.
> 
> Perhaps it makes more sense to say that the Linux 
> "performance events logging framework" has become more 
> generic and is really now an "event logging framework".

Yeah, it essentially is.

> > Furthermore it's NMI safe, offers structured logging, 
> > has various streaming, multiplexing and filtering 
> > capabilities that come handy for RAS purposes and 
> > more.
> 
> Those of us present at the mini-summit were not familiar 
> with all the features available. One area of concern was 
> how to be sure that something is in fact listening to 
> and logging the error events.  My understanding is that 
> if there is no process attached to an event, the kernel 
> will just drop it.  This is of particular concern 
> because the kernel's first scan of the machine check 
> banks occurs before there are any processes. So errors 
> found early in boot (which might be saved fatal errors 
> from before the boot) might be lost.

I proposed a (fairly straightforward) extension to which 
Boris agreed: we can introduce 'persistent events', which 
have task-less buffers attached to them, which will hold 
(a configurable amount of) of events.

Those can then be picked up by a task later on and no 
event is lost.

Would such a feature address your concern?

It would be useful not just for reliable error event 
collection, it could also be used for things like the boot 
tracer (which too deals with events that occur before 
there are any user-space tasks to pick up events).

I.e. it fits into the whole scheme in a pretty natural, 
multi-purpose way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 19:30             ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 19:30 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent


* Luck, Tony <tony.luck@intel.com> wrote:

> > It makes sense to use the kernel's performance events 
> > logging framework when we are logging events about how the 
> > system performs.
> 
> Perhaps it makes more sense to say that the Linux 
> "performance events logging framework" has become more 
> generic and is really now an "event logging framework".

Yeah, it essentially is.

> > Furthermore it's NMI safe, offers structured logging, 
> > has various streaming, multiplexing and filtering 
> > capabilities that come handy for RAS purposes and 
> > more.
> 
> Those of us present at the mini-summit were not familiar 
> with all the features available. One area of concern was 
> how to be sure that something is in fact listening to 
> and logging the error events.  My understanding is that 
> if there is no process attached to an event, the kernel 
> will just drop it.  This is of particular concern 
> because the kernel's first scan of the machine check 
> banks occurs before there are any processes. So errors 
> found early in boot (which might be saved fatal errors 
> from before the boot) might be lost.

I proposed a (fairly straightforward) extension to which 
Boris agreed: we can introduce 'persistent events', which 
have task-less buffers attached to them, which will hold 
(a configurable amount of) of events.

Those can then be picked up by a task later on and no 
event is lost.

Would such a feature address your concern?

It would be useful not just for reliable error event 
collection, it could also be used for things like the boot 
tracer (which too deals with events that occur before 
there are any user-space tasks to pick up events).

I.e. it fits into the whole scheme in a pretty natural, 
multi-purpose way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 19:18             ` Borislav Petkov
@ 2010-05-18 19:34               ` Ingo Molnar
  -1 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 19:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Young, Brent


* Borislav Petkov <bp@amd64.org> wrote:

> Well, we have a trace_mce_record tracepoint in the 
> mcheck code which calls all the necessary callbacks when 
> an mcheck occurs. For the time being, the idea is to use 
> the mce.c ring buffer for early mchecks and copy them to 
> the regular ftrace per-cpu buffer after the last has 
> been initialized. Later, we could switch to a another 
> early bootmem buffer if there's need to.

The end result would be even simpler by one more step: 
with persistent events we just use them and dont need the 
mce.c ringbuffer at all. (getting rid of that complication 
is one of the code cleanliness benefits i see in this move 
as a x86 maintainer - beyond the obvious generalization 
and unification benefits.)

> Also, we want to have a userspace daemon that reads out 
> the mces from the trace buffer and does further 
> processing like thresholding etc in userspace.
> 
> Concerning critical errors, there we bypass the perf 
> subsystem and execute the smallest amount of code 
> possible while trying to shutdown gracefully if the 
> error type allows that.

Yeah. Each perf_event can have arbitrary callbacks with 
add-on (or critical) functionality. We would activate the 
event(s) during bootup and it would do its thing from that 
point on: critical functionality gets a direct path via 
the callback, and every other event that survives goes via 
the regular perf output channels, to one (or more) 
consumers/subscribers of these events.

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 19:34               ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 19:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Young, Brent


* Borislav Petkov <bp@amd64.org> wrote:

> Well, we have a trace_mce_record tracepoint in the 
> mcheck code which calls all the necessary callbacks when 
> an mcheck occurs. For the time being, the idea is to use 
> the mce.c ring buffer for early mchecks and copy them to 
> the regular ftrace per-cpu buffer after the last has 
> been initialized. Later, we could switch to a another 
> early bootmem buffer if there's need to.

The end result would be even simpler by one more step: 
with persistent events we just use them and dont need the 
mce.c ringbuffer at all. (getting rid of that complication 
is one of the code cleanliness benefits i see in this move 
as a x86 maintainer - beyond the obvious generalization 
and unification benefits.)

> Also, we want to have a userspace daemon that reads out 
> the mces from the trace buffer and does further 
> processing like thresholding etc in userspace.
> 
> Concerning critical errors, there we bypass the perf 
> subsystem and execute the smallest amount of code 
> possible while trying to shutdown gracefully if the 
> error type allows that.

Yeah. Each perf_event can have arbitrary callbacks with 
add-on (or critical) functionality. We would activate the 
event(s) during bootup and it would do its thing from that 
point on: critical functionality gets a direct path via 
the callback, and every other event that survives goes via 
the regular perf output channels, to one (or more) 
consumers/subscribers of these events.

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 19:30             ` Ingo Molnar
  (?)
@ 2010-05-18 20:42             ` Ingo Molnar
  2010-05-18 21:37               ` Tony Luck
  -1 siblings, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 20:42 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo


* Ingo Molnar <mingo@elte.hu> wrote:

> > > Furthermore it's NMI safe, offers structured 
> > > logging, has various streaming, multiplexing and 
> > > filtering capabilities that come handy for RAS 
> > > purposes and more.
> > 
> > Those of us present at the mini-summit were not 
> > familiar with all the features available. One area of 
> > concern was how to be sure that something is in fact 
> > listening to and logging the error events.  My 
> > understanding is that if there is no process attached 
> > to an event, the kernel will just drop it.  This is of 
> > particular concern because the kernel's first scan of 
> > the machine check banks occurs before there are any 
> > processes. So errors found early in boot (which might 
> > be saved fatal errors from before the boot) might be 
> > lost.
> 
> I proposed a (fairly straightforward) extension to which 
> Boris agreed: we can introduce 'persistent events', 
> which have task-less buffers attached to them, which 
> will hold (a configurable amount of) of events.
> 
> Those can then be picked up by a task later on and no 
> event is lost.
> 
> Would such a feature address your concern?

Tony, should we accelerate the development of this 
persistent events sub-feature?

Boris posted initial patches of the new perf events based 
EDAC/MCE/RAS design direction to lkml and indicated that 
it works for him. He also indicated that he can do the 
initial work of unifying EDAC and MCE without the 
persistent events feature for now. (this all is obviously 
v2.6.36-ish material)

But if it's important, if you'd like to move ahead with 
the unification swiftly then we can certainly increase its 
priority.

Also, a few notes:

1) the new RAS tool itself might or might not be part of 
tools/perf/ - for the prototype it certainly makes sense 
to be there but otherwise feel free to start tools/ras/ 
and share code with tools/perf/ but otherwise keep a 
separate RAS tool-space.

2) There's a new perf feature (that went upstream today) 
that is of EDAC/RAS interest: the ability to do live 
tracing. This is basically a daemon-alike, 
event->policy-action based flow that RAS eventing is 
about.

3) Another new perf feature of interest is 'perf inject' 
(this too went upstream today): to inject artificial 
events into the stream of events. This mechanism could be 
used to simulate rare error conditions and to test out 
policy reactions systematically - an important part of 
system error recovery testing.

4) We are working on enumerating events via sysfs, not via 
debugfs. This would make the events provided by EDAC/MCE 
more generally available. See Lin Ming's patches on lkml:

  Subject: [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs

Please chime in that thread to make sure the event_source 
class is suitable to describe EDAC/MCE event sources as 
well. Any event_source that is made available by drivers 
can then by used by tools for event transport.

This gives us a broad platform to add various RAS events 
as well, beyond raw hardware events: we could for example 
events for various system anomalies such as lockup 
messages, kernel warnings/oopses, IOMMU exceptions - maybe 
even pure software concepts such as fatal segmentation 
fault events, etc. etc.

That way the RAS daemon could build and utilize a complete 
and coherent set of events it wants to subscribe to - all 
via the same event transport mechanism. It would thus have 
a comprehensive 'system health' view, via a single, 
reliable mechanism, and could act in a wide range of 
scenarios, with a wide range of policy actions, based on a 
very complete picture.

Getting all those features will certainly take time and 
effort, but this is the big picture where the whole idea 
leads us to: a genuinely more capable, more generic and 
more flexible RAS implementation for Linux.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 20:42             ` Ingo Molnar
@ 2010-05-18 21:37               ` Tony Luck
  2010-05-18 22:00                 ` Ingo Molnar
  2010-05-19  6:39                 ` Ingo Molnar
  0 siblings, 2 replies; 108+ messages in thread
From: Tony Luck @ 2010-05-18 21:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo

On Tue, May 18, 2010 at 1:42 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> I proposed a (fairly straightforward) extension to which
>> Boris agreed: we can introduce 'persistent events',
>> which have task-less buffers attached to them, which
>> will hold (a configurable amount of) of events.
>>
>> Those can then be picked up by a task later on and no
>> event is lost.
>>
>> Would such a feature address your concern?
>
> Tony, should we accelerate the development of this
> persistent events sub-feature?

The persistent event feature sounds like it will solve
the early logging issue.

> Boris posted initial patches of the new perf events based
> EDAC/MCE/RAS design direction to lkml and indicated that
> it works for him. He also indicated that he can do the
> initial work of unifying EDAC and MCE without the
> persistent events feature for now. (this all is obviously
> v2.6.36-ish material)
>
> But if it's important, if you'd like to move ahead with
> the unification swiftly then we can certainly increase its
> priority.

We've missed the deadlines for inclusion in certain
popular distributions ... so it may be OK to take a
relatively leisurely path to getting this done right
rather than rushing.

> 3) Another new perf feature of interest is 'perf inject'
> (this too went upstream today): to inject artificial
> events into the stream of events. This mechanism could be
> used to simulate rare error conditions and to test out
> policy reactions systematically - an important part of
> system error recovery testing.

Simulated errors are handy for testing the very
top level of the s/w stack. But real errors are
better. There's some APEI code in Len's tree
that can inject real errors (on systems with the
right BIOS hooks enabled).

> This gives us a broad platform to add various RAS events
> as well, beyond raw hardware events: we could for example
> events for various system anomalies such as lockup
> messages, kernel warnings/oopses, IOMMU exceptions - maybe
> even pure software concepts such as fatal segmentation
> fault events, etc. etc.

This looks like sticky ground.  I can see the event mechanism
passing data to a user daemon working well for all kinds of
corrected and minor errors. But when you start talking about
lockups and fatal errors things get a lot trickier. Often the
main concern at this point is error containment. Making sure
that the flaky data doesn't become visible (saved to storage,
transmitted to the network, etc.). Getting from a machine check
handler through some context switches (and page
faults etc.) to a user level daemon before the error
gets recorded looks to be really hard.

> That way the RAS daemon could build and utilize a complete
> and coherent set of events it wants to subscribe to - all
> via the same event transport mechanism. It would thus have
> a comprehensive 'system health' view, via a single,
> reliable mechanism, and could act in a wide range of
> scenarios, with a wide range of policy actions, based on a
> very complete picture.

In a cluster/cloud/datacenter that daemon will need to be
networked and hooked to the system management tools
that are controlling the bigger environment. But I agree
that this looks like a worthy end goal.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 21:37               ` Tony Luck
@ 2010-05-18 22:00                 ` Ingo Molnar
  2010-05-24 17:13                   ` Russ Anderson
  2010-05-19  6:39                 ` Ingo Molnar
  1 sibling, 1 reply; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 22:00 UTC (permalink / raw)
  To: Tony Luck
  Cc: Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo


* Tony Luck <tony.luck@intel.com> wrote:

> > This gives us a broad platform to add various RAS 
> > events as well, beyond raw hardware events: we could 
> > for example events for various system anomalies such 
> > as lockup messages, kernel warnings/oopses, IOMMU 
> > exceptions - maybe even pure software concepts such as 
> > fatal segmentation fault events, etc. etc.
> 
> This looks like sticky ground.  I can see the event 
> mechanism passing data to a user daemon working well for 
> all kinds of corrected and minor errors. But when you 
> start talking about lockups and fatal errors things get 
> a lot trickier. Often the main concern at this point is 
> error containment. Making sure that the flaky data 
> doesn't become visible (saved to storage, transmitted to 
> the network, etc.). [...]

I was pointing beyond the narrow hardware (memory) error 
point of view, towards a more generic 'system health' 
thinking.

In the broader view it may makes sense to for example 
define policy over excessive number of segfaults on a 
server system (where excessive segfaults are an anomaly), 
or a suspiciously large number of soft IO errors, etc.

But yes, of course, when it comes to hard memory errors, 
those take precedence, and handling them (and 
saving/propagating information about them while we still 
can) is a priority.

> [...] Getting from a machine check handler through some 
> context switches (and page faults etc.) to a user level 
> daemon before the error gets recorded looks to be really 
> hard.

As Boris mentioned it too, critical policy action can and 
will be done straight in the kernel.

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 19:18             ` Borislav Petkov
@ 2010-05-18 22:14               ` Eric W. Biederman
  -1 siblings, 0 replies; 108+ messages in thread
From: Eric W. Biederman @ 2010-05-18 22:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Ingo Molnar, Thomas Gleixner,
	Matt Domsch, Doug Thompson, Joe Perches, Ingo Molnar,
	bluesmoke-devel, Andi Kleen, Linux Edac Mailing List

Borislav Petkov <bp@amd64.org> writes:

> From: "Luck, Tony" <tony.luck@intel.com>
> Date: Tue, May 18, 2010 at 03:08:58PM -0400
>
>> > It makes sense to use the kernel's performance events 
>> > logging framework when we are logging events about how the 
>> > system performs.
>> 
>> Perhaps it makes more sense to say that the Linux "performance
>> events logging framework" has become more generic and is really
>> now an "event logging framework".
>
> Yep, that's the idea.
>
>> > Furthermore it's NMI safe, offers structured logging, has 
>> > various streaming, multiplexing and filtering capabilities 
>> > that come handy for RAS purposes and more.
>> 
>> Those of us present at the mini-summit were not familiar with
>> all the features available. One area of concern was how to be
>> sure that something is in fact listening to and logging the
>> error events.  My understanding is that if there is no process
>> attached to an event, the kernel will just drop it.  This is
>> of particular concern because the kernel's first scan of the
>> machine check banks occurs before there are any processes.
>> So errors found early in boot (which might be saved fatal
>> errors from before the boot) might be lost.
>
> Well, we have a trace_mce_record tracepoint in the mcheck code which
> calls all the necessary callbacks when an mcheck occurs. For the time
> being, the idea is to use the mce.c ring buffer for early mchecks and
> copy them to the regular ftrace per-cpu buffer after the last has been
> initialized. Later, we could switch to a another early bootmem buffer if
> there's need to.
>
> Also, we want to have a userspace daemon that reads out the mces from
> the trace buffer and does further processing like thresholding etc in
> userspace.
>
> Concerning critical errors, there we bypass the perf subsystem and
> execute the smallest amount of code possible while trying to shutdown
> gracefully if the error type allows that.
>
> These are the rough ideas at least...

Can someone please tell me why everyone is eager to squirrel
correctable error reports away and not report them in dmesg? aka
syslog.

I have had on several occasions a machine with memory errors that
mcelog or the BIOS was eating the error reports and not putting them
anywhere a normal human being would look.

If your system isn't broken correctable errors are rare.  People look
at syslog.  People look in /var/log/messages and dmesg when something
goes weird.

I have no problem with additional interfaces to provide additional
functionality but please can we put errors where people can find them.

Eric

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 22:14               ` Eric W. Biederman
  0 siblings, 0 replies; 108+ messages in thread
From: Eric W. Biederman @ 2010-05-18 22:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Ingo Molnar, Thomas Gleixner,
	Matt Domsch, Doug Thompson, Joe Perches, Ingo Molnar,
	bluesmoke-devel, Andi Kleen, Linux Edac Mailing List

Borislav Petkov <bp@amd64.org> writes:

> From: "Luck, Tony" <tony.luck@intel.com>
> Date: Tue, May 18, 2010 at 03:08:58PM -0400
>
>> > It makes sense to use the kernel's performance events 
>> > logging framework when we are logging events about how the 
>> > system performs.
>> 
>> Perhaps it makes more sense to say that the Linux "performance
>> events logging framework" has become more generic and is really
>> now an "event logging framework".
>
> Yep, that's the idea.
>
>> > Furthermore it's NMI safe, offers structured logging, has 
>> > various streaming, multiplexing and filtering capabilities 
>> > that come handy for RAS purposes and more.
>> 
>> Those of us present at the mini-summit were not familiar with
>> all the features available. One area of concern was how to be
>> sure that something is in fact listening to and logging the
>> error events.  My understanding is that if there is no process
>> attached to an event, the kernel will just drop it.  This is
>> of particular concern because the kernel's first scan of the
>> machine check banks occurs before there are any processes.
>> So errors found early in boot (which might be saved fatal
>> errors from before the boot) might be lost.
>
> Well, we have a trace_mce_record tracepoint in the mcheck code which
> calls all the necessary callbacks when an mcheck occurs. For the time
> being, the idea is to use the mce.c ring buffer for early mchecks and
> copy them to the regular ftrace per-cpu buffer after the last has been
> initialized. Later, we could switch to a another early bootmem buffer if
> there's need to.
>
> Also, we want to have a userspace daemon that reads out the mces from
> the trace buffer and does further processing like thresholding etc in
> userspace.
>
> Concerning critical errors, there we bypass the perf subsystem and
> execute the smallest amount of code possible while trying to shutdown
> gracefully if the error type allows that.
>
> These are the rough ideas at least...

Can someone please tell me why everyone is eager to squirrel
correctable error reports away and not report them in dmesg? aka
syslog.

I have had on several occasions a machine with memory errors that
mcelog or the BIOS was eating the error reports and not putting them
anywhere a normal human being would look.

If your system isn't broken correctable errors are rare.  People look
at syslog.  People look in /var/log/messages and dmesg when something
goes weird.

I have no problem with additional interfaces to provide additional
functionality but please can we put errors where people can find them.

Eric

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 22:14               ` Eric W. Biederman
@ 2010-05-18 22:28                 ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-18 22:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel, Andi Kleen,
	Linux Edac Mailing List


The original motivation to put them somewhere else
because I was sick of people reporting them as kernel bugs.

But there's more to it now:

> If your system isn't broken correctable errors are rare.  People look

Actually the more memory you have the more common they are.
And the trend is to more and more memory.

Really to do anything useful with them you need trends
and automatic actions (like predictive page offlining)

A log isn't really a good format for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 22:28                 ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-18 22:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel, Andi Kleen,
	Linux Edac Mailing List


The original motivation to put them somewhere else
because I was sick of people reporting them as kernel bugs.

But there's more to it now:

> If your system isn't broken correctable errors are rare.  People look

Actually the more memory you have the more common they are.
And the trend is to more and more memory.

Really to do anything useful with them you need trends
and automatic actions (like predictive page offlining)

A log isn't really a good format for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 22:14               ` Eric W. Biederman
@ 2010-05-18 22:29                 ` Ingo Molnar
  -1 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 22:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Andi Kleen,
	Linux Edac Mailing List


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> > [...]
> >
> > Concerning critical errors, there we bypass the perf 
> > subsystem and execute the smallest amount of code 
> > possible while trying to shutdown gracefully if the 
> > error type allows that.
> >
> > These are the rough ideas at least...
> 
> Can someone please tell me why everyone is eager to 
> squirrel correctable error reports away and not report 
> them in dmesg? aka syslog.
> 
> I have had on several occasions a machine with memory 
> errors that mcelog or the BIOS was eating the error 
> reports and not putting them anywhere a normal human 
> being would look.

That's possible too - the TRACE_EVENT() of MCE events, 
beyond the record format, also includes a human-readable 
ASCII output format string:

 # tail -1 /debug/tracing/events/mce/mce_record/format

 print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, 
 ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, 
 PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", 
 REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, 
 REC->status, REC->addr, REC->misc, REC->cs, REC->ip, 
 REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, 
 REC->socketid, REC->apicid

Which could be used to printk events.

Cheers,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-18 22:29                 ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-18 22:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Andi Kleen,
	Linux Edac Mailing List


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> > [...]
> >
> > Concerning critical errors, there we bypass the perf 
> > subsystem and execute the smallest amount of code 
> > possible while trying to shutdown gracefully if the 
> > error type allows that.
> >
> > These are the rough ideas at least...
> 
> Can someone please tell me why everyone is eager to 
> squirrel correctable error reports away and not report 
> them in dmesg? aka syslog.
> 
> I have had on several occasions a machine with memory 
> errors that mcelog or the BIOS was eating the error 
> reports and not putting them anywhere a normal human 
> being would look.

That's possible too - the TRACE_EVENT() of MCE events, 
beyond the record format, also includes a human-readable 
ASCII output format string:

 # tail -1 /debug/tracing/events/mce/mce_record/format

 print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, 
 ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, 
 PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", 
 REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, 
 REC->status, REC->addr, REC->misc, REC->cs, REC->ip, 
 REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, 
 REC->socketid, REC->apicid

Which could be used to printk events.

Cheers,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 22:28                 ` Andi Kleen
@ 2010-05-19  1:14                   ` Eric W. Biederman
  -1 siblings, 0 replies; 108+ messages in thread
From: Eric W. Biederman @ 2010-05-19  1:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

Andi Kleen <andi@firstfloor.org> writes:

> The original motivation to put them somewhere else
> because I was sick of people reporting them as kernel bugs.

This suggests that to get things reported in dmesg I should
setup a cron job that pulls the latest kernel checks to see
if things are reported into syslog and sends you an email
if things are wrong.

I'm not ready to believe the average person that is running linux
is too stupid to understand the difference between a hardware
error and a software error.

> But there's more to it now:
>
>> If your system isn't broken correctable errors are rare.  People look
>
> Actually the more memory you have the more common they are.
> And the trend is to more and more memory.

The error rate should not be fixed per bit but should be roughly fixed
per DIMM.  If the error rate over time is fixed per bit we are in deep
trouble.

> Really to do anything useful with them you need trends
> and automatic actions (like predictive page offlining)

Not at all, and I don't have a clue where you start thinking
predictive page offlining makes the least bit of sense.  Broken
or even weak bits are rarely the common reason for ECC errors.

> A log isn't really a good format for that

A log is a fine format for realizing you have a problem.  A
log doesn't need to be the only place errors are reported
but a log should be the default place ECC errors are reported.
We do that with hard drive errors and other kinds of hardware
errors and we have done it for years without problems.

My experience is that correctable ECC errors come in two kinds of
frequencies.

- The expected single bit correctable error range.  Which is somewhere
  between once a month and once a year per dimm.

  On the most unreasonable configuration I ever worked with. 4TB of ram
  in 1GB sticks up at Los Alomos, at 7000ft in an environment know
  to trigger errors I saw roughly one correctable ECC error an hour.
  Huge but just barely within the expected range.

  I can live with a log message once a month on a mundane system.

- Errors that occur frequently. That is broken hardware of one time or
  another.  I want to know about that so I can schedule down time to replace
  my memory before I get an uncorrected ECC error.  Errors of this kind
  are likely happening frequently enough as to impact performance.

Eric

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-19  1:14                   ` Eric W. Biederman
  0 siblings, 0 replies; 108+ messages in thread
From: Eric W. Biederman @ 2010-05-19  1:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

Andi Kleen <andi@firstfloor.org> writes:

> The original motivation to put them somewhere else
> because I was sick of people reporting them as kernel bugs.

This suggests that to get things reported in dmesg I should
setup a cron job that pulls the latest kernel checks to see
if things are reported into syslog and sends you an email
if things are wrong.

I'm not ready to believe the average person that is running linux
is too stupid to understand the difference between a hardware
error and a software error.

> But there's more to it now:
>
>> If your system isn't broken correctable errors are rare.  People look
>
> Actually the more memory you have the more common they are.
> And the trend is to more and more memory.

The error rate should not be fixed per bit but should be roughly fixed
per DIMM.  If the error rate over time is fixed per bit we are in deep
trouble.

> Really to do anything useful with them you need trends
> and automatic actions (like predictive page offlining)

Not at all, and I don't have a clue where you start thinking
predictive page offlining makes the least bit of sense.  Broken
or even weak bits are rarely the common reason for ECC errors.

> A log isn't really a good format for that

A log is a fine format for realizing you have a problem.  A
log doesn't need to be the only place errors are reported
but a log should be the default place ECC errors are reported.
We do that with hard drive errors and other kinds of hardware
errors and we have done it for years without problems.

My experience is that correctable ECC errors come in two kinds of
frequencies.

- The expected single bit correctable error range.  Which is somewhere
  between once a month and once a year per dimm.

  On the most unreasonable configuration I ever worked with. 4TB of ram
  in 1GB sticks up at Los Alomos, at 7000ft in an environment know
  to trigger errors I saw roughly one correctable ECC error an hour.
  Huge but just barely within the expected range.

  I can live with a log message once a month on a mundane system.

- Errors that occur frequently. That is broken hardware of one time or
  another.  I want to know about that so I can schedule down time to replace
  my memory before I get an uncorrected ECC error.  Errors of this kind
  are likely happening frequently enough as to impact performance.

Eric

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 21:37               ` Tony Luck
  2010-05-18 22:00                 ` Ingo Molnar
@ 2010-05-19  6:39                 ` Ingo Molnar
  1 sibling, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-19  6:39 UTC (permalink / raw)
  To: Tony Luck
  Cc: Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo


* Tony Luck <tony.luck@intel.com> wrote:

> > 3) Another new perf feature of interest is 'perf 
> > inject' (this too went upstream today): to inject 
> > artificial events into the stream of events. This 
> > mechanism could be used to simulate rare error 
> > conditions and to test out policy reactions 
> > systematically - an important part of system error 
> > recovery testing.
> 
> Simulated errors are handy for testing the very top 
> level of the s/w stack. But real errors are better. 
> There's some APEI code in Len's tree that can inject 
> real errors (on systems with the right BIOS hooks 
> enabled).

Agreed, hardware assisted error injection is by far the 
best and most complete solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  1:14                   ` Eric W. Biederman
@ 2010-05-19  6:46                     ` Borislav Petkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-05-19  6:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, May 18, 2010 at 09:14:09PM -0400

> - Errors that occur frequently. That is broken hardware of one time or
>   another.  I want to know about that so I can schedule down time to replace
>   my memory before I get an uncorrected ECC error.  Errors of this kind
>   are likely happening frequently enough as to impact performance.

This is exactly the reason why we need a better error logging and
reporting than a log. How do you want to discover trends and count CECCs
per DIMM if you scan the logs all the time and grep for the DRAM page
it happened, the CS row it is located in and whether this is located in
the same DIMM as the 115th error back in the log? This gets especially
tricky if you're using one of the gazillion memory interleaving schemes.

Ok, and what about other errors like L3 cache errors, for example? You
want to count those too and upon reaching a threshold disable a cache
index _before_ it turns a correctable ECC into an uncorrectable error
bringing the whole system down with a critical MCE.

How about error injection, you want to test the hardware/software with
injecting real hardware errors and not simulating it all in software.

And also you want to be able to schedule different maintenance actions
depending on the severity of the error and in certain cases get away
with a clean shutdown even in the face of an uncorrectable error.

So, the whole idea entails much more than reporting errors in the syslog
but rather making the system intelligent enough to prolong its own life
and be able to warn the user that something bad is about to happen.

And we don't have that right now - right now we say that some machine
checks have been logged and with uncorrectable MCEs we freeze cowardly
and hope to be able to make a warm reset so that the MCA MSRs still
contain some valid data which we can decode painstakingly by hand.

I hope this makes our intentions a bit clearer.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-19  6:46                     ` Borislav Petkov
  0 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-05-19  6:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, May 18, 2010 at 09:14:09PM -0400

> - Errors that occur frequently. That is broken hardware of one time or
>   another.  I want to know about that so I can schedule down time to replace
>   my memory before I get an uncorrected ECC error.  Errors of this kind
>   are likely happening frequently enough as to impact performance.

This is exactly the reason why we need a better error logging and
reporting than a log. How do you want to discover trends and count CECCs
per DIMM if you scan the logs all the time and grep for the DRAM page
it happened, the CS row it is located in and whether this is located in
the same DIMM as the 115th error back in the log? This gets especially
tricky if you're using one of the gazillion memory interleaving schemes.

Ok, and what about other errors like L3 cache errors, for example? You
want to count those too and upon reaching a threshold disable a cache
index _before_ it turns a correctable ECC into an uncorrectable error
bringing the whole system down with a critical MCE.

How about error injection, you want to test the hardware/software with
injecting real hardware errors and not simulating it all in software.

And also you want to be able to schedule different maintenance actions
depending on the severity of the error and in certain cases get away
with a clean shutdown even in the face of an uncorrectable error.

So, the whole idea entails much more than reporting errors in the syslog
but rather making the system intelligent enough to prolong its own life
and be able to warn the user that something bad is about to happen.

And we don't have that right now - right now we say that some machine
checks have been logged and with uncorrectable MCEs we freeze cowardly
and hope to be able to make a warm reset so that the MCA MSRs still
contain some valid data which we can decode painstakingly by hand.

I hope this makes our intentions a bit clearer.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  6:46                     ` Borislav Petkov
@ 2010-05-19  7:09                       ` Ingo Molnar
  -1 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-19  7:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Eric W. Biederman, Andi Kleen, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Linux Edac Mailing List


* Borislav Petkov <bp@amd64.org> wrote:

> From: "Eric W. Biederman" <ebiederm@xmission.com>
> Date: Tue, May 18, 2010 at 09:14:09PM -0400
> 
> > - Errors that occur frequently. That is broken 
> >   hardware of one time or another.  I want to know 
> >   about that so I can schedule down time to replace my 
> >   memory before I get an uncorrected ECC error.  
> >   Errors of this kind are likely happening frequently 
> >   enough as to impact performance.
> 
> This is exactly the reason why we need a better error 
> logging and reporting than a log.
>
> [ ... lots of specific details snipped ... ]

Basically the idea behind the generic structured logging 
framework (the perf events kernel subsystem) is to have 
both ASCII output (where desired: critical errors), but to 
also have well-specified event format parsable to 
user-space tools.

Plus there's the need for fast, lightweight, flexible 
event passing mechanism - which is given by the perf 
events transport which enables arbitrary size in-memory 
ring-buffers, poll() and epoll support, etc.

perf events supports all these different usecases and 
comes with a (constantly growing) set of events already 
defined upstream. We've got more than a dozen different 
upstream subsystems that have defined events and we have 
over a hundred individual events. There's a rapidly 
growing tool space that makes case by case use of these 
event sources to measure/observe various aspects of the 
system.

Regarding dmesg, there's a WIP patch on lkml that 
integrates printks into this framework as well - makes 
each printk also available as a special string event.

That way a tool can have both programmatic access to 
printk output (without having to interact with the syslog 
buffer itself) - together with all the other structured 
log sources, while humans can also see what is happening.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-19  7:09                       ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-19  7:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Eric W. Biederman, Andi Kleen, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Linux Edac Mailing List


* Borislav Petkov <bp@amd64.org> wrote:

> From: "Eric W. Biederman" <ebiederm@xmission.com>
> Date: Tue, May 18, 2010 at 09:14:09PM -0400
> 
> > - Errors that occur frequently. That is broken 
> >   hardware of one time or another.  I want to know 
> >   about that so I can schedule down time to replace my 
> >   memory before I get an uncorrected ECC error.  
> >   Errors of this kind are likely happening frequently 
> >   enough as to impact performance.
> 
> This is exactly the reason why we need a better error 
> logging and reporting than a log.
>
> [ ... lots of specific details snipped ... ]

Basically the idea behind the generic structured logging 
framework (the perf events kernel subsystem) is to have 
both ASCII output (where desired: critical errors), but to 
also have well-specified event format parsable to 
user-space tools.

Plus there's the need for fast, lightweight, flexible 
event passing mechanism - which is given by the perf 
events transport which enables arbitrary size in-memory 
ring-buffers, poll() and epoll support, etc.

perf events supports all these different usecases and 
comes with a (constantly growing) set of events already 
defined upstream. We've got more than a dozen different 
upstream subsystems that have defined events and we have 
over a hundred individual events. There's a rapidly 
growing tool space that makes case by case use of these 
event sources to measure/observe various aspects of the 
system.

Regarding dmesg, there's a WIP patch on lkml that 
integrates printks into this framework as well - makes 
each printk also available as a special string event.

That way a tool can have both programmatic access to 
printk output (without having to interact with the syslog 
buffer itself) - together with all the other structured 
log sources, while humans can also see what is happening.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  1:14                   ` Eric W. Biederman
@ 2010-05-19  9:03                     ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-19  9:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

Hi Eric,

> I'm not ready to believe the average person that is running linux
> is too stupid to understand the difference between a hardware
> error and a software error.

Experience disagrees with you (that is not sure about average,
but at least there's a significant portion) 

Also again today there are other reasons for it.

> 
> > But there's more to it now:
> >
> >> If your system isn't broken correctable errors are rare.  People look
> >
> > Actually the more memory you have the more common they are.
> > And the trend is to more and more memory.
> 
> The error rate should not be fixed per bit but should be roughly fixed
> per DIMM.  If the error rate over time is fixed per bit we are in deep
> trouble.

Error rates of good DIMMs scale roughly with the number of transistors.
It's not the only influence though, but a major one.

> > Really to do anything useful with them you need trends
> > and automatic actions (like predictive page offlining)
> 
> Not at all, and I don't have a clue where you start thinking
> predictive page offlining makes the least bit of sense.  Broken
> or even weak bits are rarely the common reason for ECC errors.

There are various studies that disagree with you on that.

> 
> > A log isn't really a good format for that
> 
> A log is a fine format for realizing you have a problem.  A

A low steady rate of corrected errors on a large system
is expected.  In fact if you look at the memory error log.
of a large system (towards TBs) it nearly always has some 
memory related events.

In this case a log is not really useful. What you need
is useful thresholds and a good summary.

> - Errors that occur frequently. That is broken hardware of one time or
>   another.  I want to know about that so I can schedule down time to replace
>   my memory before I get an uncorrected ECC error.  Errors of this kind
>   are likely happening frequently enough as to impact performance.

Same issue here: if something is truly broken it floods
you with errors.

First this costs a lot of time to process and it does not 
actually tell you anything useful because most errors in a flood
are similar.

Basically you don't care if you have 100 or 1000 errors, 
and you definitely don't want all the of the errors filling up
your disk and using up your CPU.

Again a threshold with an action is much more useful here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-19  9:03                     ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-19  9:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

Hi Eric,

> I'm not ready to believe the average person that is running linux
> is too stupid to understand the difference between a hardware
> error and a software error.

Experience disagrees with you (that is not sure about average,
but at least there's a significant portion) 

Also again today there are other reasons for it.

> 
> > But there's more to it now:
> >
> >> If your system isn't broken correctable errors are rare.  People look
> >
> > Actually the more memory you have the more common they are.
> > And the trend is to more and more memory.
> 
> The error rate should not be fixed per bit but should be roughly fixed
> per DIMM.  If the error rate over time is fixed per bit we are in deep
> trouble.

Error rates of good DIMMs scale roughly with the number of transistors.
It's not the only influence though, but a major one.

> > Really to do anything useful with them you need trends
> > and automatic actions (like predictive page offlining)
> 
> Not at all, and I don't have a clue where you start thinking
> predictive page offlining makes the least bit of sense.  Broken
> or even weak bits are rarely the common reason for ECC errors.

There are various studies that disagree with you on that.

> 
> > A log isn't really a good format for that
> 
> A log is a fine format for realizing you have a problem.  A

A low steady rate of corrected errors on a large system
is expected.  In fact if you look at the memory error log.
of a large system (towards TBs) it nearly always has some 
memory related events.

In this case a log is not really useful. What you need
is useful thresholds and a good summary.

> - Errors that occur frequently. That is broken hardware of one time or
>   another.  I want to know about that so I can schedule down time to replace
>   my memory before I get an uncorrected ECC error.  Errors of this kind
>   are likely happening frequently enough as to impact performance.

Same issue here: if something is truly broken it floods
you with errors.

First this costs a lot of time to process and it does not 
actually tell you anything useful because most errors in a flood
are similar.

Basically you don't care if you have 100 or 1000 errors, 
and you definitely don't want all the of the errors filling up
your disk and using up your CPU.

Again a threshold with an action is much more useful here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  7:09                       ` Ingo Molnar
@ 2010-05-19 11:54                         ` Mauro Carvalho Chehab
  -1 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-19 11:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Eric W. Biederman, Andi Kleen, Luck, Tony,
	Hidetoshi Seto, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Linux Edac Mailing List

Ingo Molnar wrote:
> Regarding dmesg, there's a WIP patch on lkml that 
> integrates printks into this framework as well - makes 
> each printk also available as a special string event.
> 
> That way a tool can have both programmatic access to 
> printk output (without having to interact with the syslog 
> buffer itself) - together with all the other structured 
> log sources, while humans can also see what is happening.

Some system admins prefer to have everything on dmesg, as they
can enable a serial console, and catch the logs remotely, even
when the machine crashes for example due to a hardware failure.

So, IMHO, one feature that the perf event needs is the capability
to report errors via a serial console also, or a mechanism where
some events are sent via dmesg.

-- 

Cheers,
Mauro

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-19 11:54                         ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 108+ messages in thread
From: Mauro Carvalho Chehab @ 2010-05-19 11:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hidetoshi Seto, Luck, Tony, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, bluesmoke-devel,
	Andi Kleen, Eric W. Biederman, Doug Thompson, Joe Perches,
	Thomas Gleixner, Linux Edac Mailing List, Ingo Molnar,
	Matt Domsch

Ingo Molnar wrote:
> Regarding dmesg, there's a WIP patch on lkml that 
> integrates printks into this framework as well - makes 
> each printk also available as a special string event.
> 
> That way a tool can have both programmatic access to 
> printk output (without having to interact with the syslog 
> buffer itself) - together with all the other structured 
> log sources, while humans can also see what is happening.

Some system admins prefer to have everything on dmesg, as they
can enable a serial console, and catch the logs remotely, even
when the machine crashes for example due to a hardware failure.

So, IMHO, one feature that the perf event needs is the capability
to report errors via a serial console also, or a mechanism where
some events are sent via dmesg.

-- 

Cheers,
Mauro

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  1:14                   ` Eric W. Biederman
@ 2010-05-19 17:30                     ` Tony Luck
  -1 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-05-19 17:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

On Tue, May 18, 2010 at 6:14 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> A log is a fine format for realizing you have a problem.  A
> log doesn't need to be the only place errors are reported
> but a log should be the default place ECC errors are reported.
> We do that with hard drive errors and other kinds of hardware
> errors and we have done it for years without problems.

Hard drives aren't really a similar situation ... we don't see any
of the low level errors from a modern hard drive because its
f/w handles the retries and block re-mapping transparently.
By the time something serious enough happens that it gets
reported to the OS, we pretty much already know that there
is a real problem.

We are still in the dark ages for memory errors where the OS
is expected to look at all the errors and figure out whether they
represent any kind of meaningful pattern that requires some
action to replace h/w components.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-19 17:30                     ` Tony Luck
  0 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-05-19 17:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

On Tue, May 18, 2010 at 6:14 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> A log is a fine format for realizing you have a problem.  A
> log doesn't need to be the only place errors are reported
> but a log should be the default place ECC errors are reported.
> We do that with hard drive errors and other kinds of hardware
> errors and we have done it for years without problems.

Hard drives aren't really a similar situation ... we don't see any
of the low level errors from a modern hard drive because its
f/w handles the retries and block re-mapping transparently.
By the time something serious enough happens that it gets
reported to the OS, we pretty much already know that there
is a real problem.

We are still in the dark ages for memory errors where the OS
is expected to look at all the errors and figure out whether they
represent any kind of meaningful pattern that requires some
action to replace h/w components.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19 11:54                         ` Mauro Carvalho Chehab
@ 2010-05-20 12:37                           ` Ingo Molnar
  -1 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-20 12:37 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Eric W. Biederman, Andi Kleen, Luck, Tony,
	Hidetoshi Seto, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Linux Edac Mailing List


* Mauro Carvalho Chehab <mchehab@redhat.com> wrote:

> Ingo Molnar wrote:
>
> > Regarding dmesg, there's a WIP patch on lkml that 
> > integrates printks into this framework as well - makes 
> > each printk also available as a special string event.
> > 
> > That way a tool can have both programmatic access to 
> > printk output (without having to interact with the 
> > syslog buffer itself) - together with all the other 
> > structured log sources, while humans can also see what 
> > is happening.
> 
> Some system admins prefer to have everything on dmesg, 
> as they can enable a serial console, and catch the logs 
> remotely, even when the machine crashes for example due 
> to a hardware failure.
> 
> So, IMHO, one feature that the perf event needs is the 
> capability to report errors via a serial console also, 
> or a mechanism where some events are sent via dmesg.

Yeah. That can be an aspect of the callback - or might 
even be integrated into the core code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-20 12:37                           ` Ingo Molnar
  0 siblings, 0 replies; 108+ messages in thread
From: Ingo Molnar @ 2010-05-20 12:37 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Eric W. Biederman, Andi Kleen, Luck, Tony,
	Hidetoshi Seto, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, bluesmoke-devel, Linux Edac Mailing List


* Mauro Carvalho Chehab <mchehab@redhat.com> wrote:

> Ingo Molnar wrote:
>
> > Regarding dmesg, there's a WIP patch on lkml that 
> > integrates printks into this framework as well - makes 
> > each printk also available as a special string event.
> > 
> > That way a tool can have both programmatic access to 
> > printk output (without having to interact with the 
> > syslog buffer itself) - together with all the other 
> > structured log sources, while humans can also see what 
> > is happening.
> 
> Some system admins prefer to have everything on dmesg, 
> as they can enable a serial console, and catch the logs 
> remotely, even when the machine crashes for example due 
> to a hardware failure.
> 
> So, IMHO, one feature that the perf event needs is the 
> capability to report errors via a serial console also, 
> or a mechanism where some events are sent via dmesg.

Yeah. That can be an aspect of the callback - or might 
even be integrated into the core code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19 17:30                     ` Tony Luck
@ 2010-05-24 15:55                       ` Russ Anderson
  -1 siblings, 0 replies; 108+ messages in thread
From: Russ Anderson @ 2010-05-24 15:55 UTC (permalink / raw)
  To: Tony Luck
  Cc: Eric W. Biederman, Andi Kleen, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List, rja

On Wed, May 19, 2010 at 10:30:17AM -0700, Tony Luck wrote:
> 
> We are still in the dark ages for memory errors where the OS
> is expected to look at all the errors and figure out whether they
> represent any kind of meaningful pattern that requires some
> action to replace h/w components.

ia64 is good at detecting & recovering from memory uncorrectable
errors.  x86 is significantly behind, due to historically not
being able to recover from uncorrectable memory errors.  

ia64 had the Intel defined MCA Spec which defined the interaction
between SAL and the kernel.  x86 does not have a similar well
defined way of how errors should be handled.  It would be 
good to agree on how the errors should be handled.

> -Tony

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-24 15:55                       ` Russ Anderson
  0 siblings, 0 replies; 108+ messages in thread
From: Russ Anderson @ 2010-05-24 15:55 UTC (permalink / raw)
  To: Tony Luck
  Cc: Eric W. Biederman, Andi Kleen, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List, rja

On Wed, May 19, 2010 at 10:30:17AM -0700, Tony Luck wrote:
> 
> We are still in the dark ages for memory errors where the OS
> is expected to look at all the errors and figure out whether they
> represent any kind of meaningful pattern that requires some
> action to replace h/w components.

ia64 is good at detecting & recovering from memory uncorrectable
errors.  x86 is significantly behind, due to historically not
being able to recover from uncorrectable memory errors.  

ia64 had the Intel defined MCA Spec which defined the interaction
between SAL and the kernel.  x86 does not have a similar well
defined way of how errors should be handled.  It would be 
good to agree on how the errors should be handled.

> -Tony

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  9:03                     ` Andi Kleen
@ 2010-05-24 16:21                       ` Russ Anderson
  -1 siblings, 0 replies; 108+ messages in thread
From: Russ Anderson @ 2010-05-24 16:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List, rja

On Wed, May 19, 2010 at 11:03:24AM +0200, Andi Kleen wrote:
> Hi Eric,
> 
> > I'm not ready to believe the average person that is running linux
> > is too stupid to understand the difference between a hardware
> > error and a software error.
> 
> Experience disagrees with you (that is not sure about average,
> but at least there's a significant portion) 
> 
> Also again today there are other reasons for it.

I agree with Andi.  While there are a wire range of users, the
vast majority know little about the hardware they are running
on.  Even in commercial settings, where users/admins are better
educated, there is little time to do detailed error analysis.

The more errors are detected/analyzed/corrected/recovered, the
better it is for everyone.

 
> > > Really to do anything useful with them you need trends
> > > and automatic actions (like predictive page offlining)
> > 
> > Not at all, and I don't have a clue where you start thinking
> > predictive page offlining makes the least bit of sense.  Broken
> > or even weak bits are rarely the common reason for ECC errors.
> 
> There are various studies that disagree with you on that.

Having the infrastructure to automatically off-line pages
is a good thing.  The details of where to set the predictive
threshold likely will be hardware specific (different DIMM
types failing at different rates).  It needs to be adjustable.

> > > A log isn't really a good format for that
> > 
> > A log is a fine format for realizing you have a problem.  A
> 
> A low steady rate of corrected errors on a large system
> is expected.  In fact if you look at the memory error log.
> of a large system (towards TBs) it nearly always has some 
> memory related events.

Yes, there are certainly examples of that.  

> In this case a log is not really useful. What you need
> is useful thresholds and a good summary.

The larger the system the more important a good summary is.

> > - Errors that occur frequently. That is broken hardware of one time or
> >   another.  I want to know about that so I can schedule down time to replace
> >   my memory before I get an uncorrected ECC error.  Errors of this kind
> >   are likely happening frequently enough as to impact performance.
> 
> Same issue here: if something is truly broken it floods
> you with errors.
> 
> First this costs a lot of time to process and it does not 
> actually tell you anything useful because most errors in a flood
> are similar.
> 
> Basically you don't care if you have 100 or 1000 errors, 
> and you definitely don't want all the of the errors filling up
> your disk and using up your CPU.
> 
> Again a threshold with an action is much more useful here.

Yes, good points.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-24 16:21                       ` Russ Anderson
  0 siblings, 0 replies; 108+ messages in thread
From: Russ Anderson @ 2010-05-24 16:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Borislav Petkov, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List, rja

On Wed, May 19, 2010 at 11:03:24AM +0200, Andi Kleen wrote:
> Hi Eric,
> 
> > I'm not ready to believe the average person that is running linux
> > is too stupid to understand the difference between a hardware
> > error and a software error.
> 
> Experience disagrees with you (that is not sure about average,
> but at least there's a significant portion) 
> 
> Also again today there are other reasons for it.

I agree with Andi.  While there are a wire range of users, the
vast majority know little about the hardware they are running
on.  Even in commercial settings, where users/admins are better
educated, there is little time to do detailed error analysis.

The more errors are detected/analyzed/corrected/recovered, the
better it is for everyone.

 
> > > Really to do anything useful with them you need trends
> > > and automatic actions (like predictive page offlining)
> > 
> > Not at all, and I don't have a clue where you start thinking
> > predictive page offlining makes the least bit of sense.  Broken
> > or even weak bits are rarely the common reason for ECC errors.
> 
> There are various studies that disagree with you on that.

Having the infrastructure to automatically off-line pages
is a good thing.  The details of where to set the predictive
threshold likely will be hardware specific (different DIMM
types failing at different rates).  It needs to be adjustable.

> > > A log isn't really a good format for that
> > 
> > A log is a fine format for realizing you have a problem.  A
> 
> A low steady rate of corrected errors on a large system
> is expected.  In fact if you look at the memory error log.
> of a large system (towards TBs) it nearly always has some 
> memory related events.

Yes, there are certainly examples of that.  

> In this case a log is not really useful. What you need
> is useful thresholds and a good summary.

The larger the system the more important a good summary is.

> > - Errors that occur frequently. That is broken hardware of one time or
> >   another.  I want to know about that so I can schedule down time to replace
> >   my memory before I get an uncorrected ECC error.  Errors of this kind
> >   are likely happening frequently enough as to impact performance.
> 
> Same issue here: if something is truly broken it floods
> you with errors.
> 
> First this costs a lot of time to process and it does not 
> actually tell you anything useful because most errors in a flood
> are similar.
> 
> Basically you don't care if you have 100 or 1000 errors, 
> and you definitely don't want all the of the errors filling up
> your disk and using up your CPU.
> 
> Again a threshold with an action is much more useful here.

Yes, good points.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-18 22:00                 ` Ingo Molnar
@ 2010-05-24 17:13                   ` Russ Anderson
  0 siblings, 0 replies; 108+ messages in thread
From: Russ Anderson @ 2010-05-24 17:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, Joe Perches, Mauro Carvalho Chehab, Hidetoshi Seto,
	Linux Kernel Mailing List, bluesmoke-devel,
	Linux Edac Mailing List, Thomas Gleixner, Ingo Molnar,
	Ben Woodard, Matt Domsch, Doug Thompson, Borislav Petkov, Young,
	Brent, Peter Zijlstra, Frédéric Weisbecker,
	Arnaldo Carvalho de Melo, Russ Anderson

On Wed, May 19, 2010 at 12:00:02AM +0200, Ingo Molnar wrote:
> * Tony Luck <tony.luck@intel.com> wrote:
> 
> > [...] Getting from a machine check handler through some 
> > context switches (and page faults etc.) to a user level 
> > daemon before the error gets recorded looks to be really 
> > hard.
> 
> As Boris mentioned it too, critical policy action can and 
> will be done straight in the kernel.

That is how it is done in ia64.  The MCA interrupt 
handler does the low level handling.  It makes sure
all the cpus have rendezvoused, looks at the MCA record
to determine what happend and does whatever recovery 
steps are needed, such as kill the application.

It definitely needs to be handled in the kernel.

> 	Ingo

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-24 15:55                       ` Russ Anderson
@ 2010-05-24 17:35                         ` Tony Luck
  -1 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-05-24 17:35 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Eric W. Biederman, Andi Kleen, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

On Mon, May 24, 2010 at 8:55 AM, Russ Anderson <rja@sgi.com> wrote:
> ia64 had the Intel defined MCA Spec which defined the interaction
> between SAL and the kernel.  x86 does not have a similar well
> defined way of how errors should be handled.  It would be
> good to agree on how the errors should be handled.

X86 has machine check registers defined by the SDM. It also
has some f/w <-> OS interactions defined by the APEI sections
in the latest ACPI spec (chapter 17 of the 4.0a spec released
last month - see http://acpi.info). Some parts look cleaner than
the ia64 SAL spec. E.g. errors logged from before the current
OS booted are presented in the Boot Error Record Table instead
of just appearing among the stream of errors that SAL_GET_ERROR
provides to the OS without any way to distinguish current errors
from old ones.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-24 17:35                         ` Tony Luck
  0 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-05-24 17:35 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Eric W. Biederman, Andi Kleen, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Ingo Molnar, Thomas Gleixner, Matt Domsch, Doug Thompson,
	Joe Perches, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

On Mon, May 24, 2010 at 8:55 AM, Russ Anderson <rja@sgi.com> wrote:
> ia64 had the Intel defined MCA Spec which defined the interaction
> between SAL and the kernel.  x86 does not have a similar well
> defined way of how errors should be handled.  It would be
> good to agree on how the errors should be handled.

X86 has machine check registers defined by the SDM. It also
has some f/w <-> OS interactions defined by the APEI sections
in the latest ACPI spec (chapter 17 of the 4.0a spec released
last month - see http://acpi.info). Some parts look cleaner than
the ia64 SAL spec. E.g. errors logged from before the current
OS booted are presented in the Boot Error Record Table instead
of just appearing among the stream of errors that SAL_GET_ERROR
provides to the OS without any way to distinguish current errors
from old ones.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-24 16:21                       ` Russ Anderson
@ 2010-05-24 18:26                         ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-24 18:26 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Eric W. Biederman, Luck, Tony, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	Thomas Gleixner, Matt Domsch, Doug Thompson, Joe Perches,
	Ingo Molnar, bluesmoke-devel, Linux Edac Mailing List

> Having the infrastructure to automatically off-line pages
> is a good thing.  The details of where to set the predictive

It's already there with a modern mcelog in daemon mode 
and a recent kernel that supports soft offlining.

> threshold likely will be hardware specific (different DIMM
> types failing at different rates).  It needs to be adjustable.

The current default in mcelog is 10 corrected errors per 24h
per 4k page or 1 uncorrected error on the page (if your CPU
supports recovering from that).  It is on by default. 

You can configure it to be different if you want.

-Andi

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-24 18:26                         ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-24 18:26 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Eric W. Biederman,
	Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

> Having the infrastructure to automatically off-line pages
> is a good thing.  The details of where to set the predictive

It's already there with a modern mcelog in daemon mode 
and a recent kernel that supports soft offlining.

> threshold likely will be hardware specific (different DIMM
> types failing at different rates).  It needs to be adjustable.

The current default in mcelog is 10 corrected errors per 24h
per 4k page or 1 uncorrected error on the page (if your CPU
supports recovering from that).  It is on by default. 

You can configure it to be different if you want.

-Andi

------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-24 17:35                         ` Tony Luck
@ 2010-05-24 18:31                           ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-24 18:31 UTC (permalink / raw)
  To: Tony Luck
  Cc: Russ Anderson, Eric W. Biederman, Borislav Petkov,
	Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Ingo Molnar, Thomas Gleixner,
	Matt Domsch, Doug Thompson, Joe Perches, Ingo Molnar,
	bluesmoke-devel, Linux Edac Mailing List

On Mon, May 24, 2010 at 10:35:21AM -0700, Tony Luck wrote:
> On Mon, May 24, 2010 at 8:55 AM, Russ Anderson <rja@sgi.com> wrote:
> > ia64 had the Intel defined MCA Spec which defined the interaction
> > between SAL and the kernel.  x86 does not have a similar well
> > defined way of how errors should be handled.  It would be
> > good to agree on how the errors should be handled.
> 
> X86 has machine check registers defined by the SDM. It also
> has some f/w <-> OS interactions defined by the APEI sections
> in the latest ACPI spec (chapter 17 of the 4.0a spec released
> last month - see http://acpi.info). Some parts look cleaner than

I should add the Intel Software Developer's manual has quite
precise guidelines on what to do (and the Linux MCE code implements
near all that faithfully) 

The ACPI spec isn't quite as precise unfortunately.

-Andi

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-05-24 18:31                           ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-05-24 18:31 UTC (permalink / raw)
  To: Tony Luck
  Cc: Russ Anderson, Eric W. Biederman, Borislav Petkov,
	Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Ingo Molnar, Thomas Gleixner,
	Matt Domsch, Doug Thompson, Joe Perches, Ingo Molnar,
	bluesmoke-devel, Linux Edac Mailing List

On Mon, May 24, 2010 at 10:35:21AM -0700, Tony Luck wrote:
> On Mon, May 24, 2010 at 8:55 AM, Russ Anderson <rja@sgi.com> wrote:
> > ia64 had the Intel defined MCA Spec which defined the interaction
> > between SAL and the kernel.  x86 does not have a similar well
> > defined way of how errors should be handled.  It would be
> > good to agree on how the errors should be handled.
> 
> X86 has machine check registers defined by the SDM. It also
> has some f/w <-> OS interactions defined by the APEI sections
> in the latest ACPI spec (chapter 17 of the 4.0a spec released
> last month - see http://acpi.info). Some parts look cleaner than

I should add the Intel Software Developer's manual has quite
precise guidelines on what to do (and the Linux MCE code implements
near all that faithfully) 

The ACPI spec isn't quite as precise unfortunately.

-Andi

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-05-19  7:09                       ` Ingo Molnar
@ 2010-06-14 10:03                         ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-14 10:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Hidetoshi Seto, Luck, Tony,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Andi Kleen, Eric W. Biederman, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch


On May 19, 2010, at 9:09 AM, Ingo Molnar wrote:
>
> Basically the idea behind the generic structured logging
> framework (the perf events kernel subsystem) is to have
> both ASCII output (where desired: critical errors), but to
> also have well-specified event format parsable to
> user-space tools.
>
> Plus there's the need for fast, lightweight, flexible
> event passing mechanism - which is given by the perf
> events transport which enables arbitrary size in-memory
> ring-buffers, poll() and epoll support, etc.
>
> perf events supports all these different usecases and
> comes with a (constantly growing) set of events already
> defined upstream. We've got more than a dozen different
> upstream subsystems that have defined events and we have
> over a hundred individual events. There's a rapidly
> growing tool space that makes case by case use of these
> event sources to measure/observe various aspects of the
> system.
>
> Regarding dmesg, there's a WIP patch on lkml that
> integrates printks into this framework as well - makes
> each printk also available as a special string event.
>
> That way a tool can have both programmatic access to
> printk output (without having to interact with the syslog
> buffer itself) - together with all the other structured
> log sources, while humans can also see what is happening.

Just left the above for reference. How would this affect other
aspects of EDAC such as the error injection, the sysfs
entries that (in most cases) reflect the layout of dimm's, and
allow the setting of scrub rate? If we're just talking about
replacing all instances of printk (when logging single bit
errors) with perf events, I don't really see that as a problem.
But EDAC is much more than that today...

Thoughts, comments?

/Nils Carlson

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 10:03                         ` Nils Carlson
  0 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-14 10:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Andi Kleen,
	Eric W. Biederman, Doug Thompson, Joe Perches, Thomas Gleixner,
	Matt Domsch, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List


On May 19, 2010, at 9:09 AM, Ingo Molnar wrote:
>
> Basically the idea behind the generic structured logging
> framework (the perf events kernel subsystem) is to have
> both ASCII output (where desired: critical errors), but to
> also have well-specified event format parsable to
> user-space tools.
>
> Plus there's the need for fast, lightweight, flexible
> event passing mechanism - which is given by the perf
> events transport which enables arbitrary size in-memory
> ring-buffers, poll() and epoll support, etc.
>
> perf events supports all these different usecases and
> comes with a (constantly growing) set of events already
> defined upstream. We've got more than a dozen different
> upstream subsystems that have defined events and we have
> over a hundred individual events. There's a rapidly
> growing tool space that makes case by case use of these
> event sources to measure/observe various aspects of the
> system.
>
> Regarding dmesg, there's a WIP patch on lkml that
> integrates printks into this framework as well - makes
> each printk also available as a special string event.
>
> That way a tool can have both programmatic access to
> printk output (without having to interact with the syslog
> buffer itself) - together with all the other structured
> log sources, while humans can also see what is happening.

Just left the above for reference. How would this affect other
aspects of EDAC such as the error injection, the sysfs
entries that (in most cases) reflect the layout of dimm's, and
allow the setting of scrub rate? If we're just talking about
replacing all instances of printk (when logging single bit
errors) with perf events, I don't really see that as a problem.
But EDAC is much more than that today...

Thoughts, comments?

/Nils Carlson

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 10:03                         ` Nils Carlson
@ 2010-06-14 11:49                           ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-14 11:49 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto, Luck, Tony,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Andi Kleen, Eric W. Biederman, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch

> Just left the above for reference. How would this affect other
> aspects of EDAC such as the error injection, the sysfs
> entries that (in most cases) reflect the layout of dimm's, and

Some of this can be probably retained, about the way EDAC
e.g. represents layout is quite unsuitable too. It includes
a lot of internal implementation details that in some cases
you can't even get anymore on modern design. Something
with a proper abstract interface is better.  EDAC never had this.

Also the biggest problem is still that EDAC doesn't
give you any silk screen labels, so unless you 
have motherboard schemantics the layout it presents
is fairly useless -- you still don't know which DIMM
to exchange. So in theory EDAC looks great, but in practice ...

On a lot of modern systems I checked DMI
seems reasonably accurate in terms of layout, so I suspect they can 
be handled with this. For others probably
still need some special driver, but one 
with a proper interface.

For error injection: some modern systems support this
though ACPI EINJ which has an separate non EDAC 
interface. For others I've been simply using some scripts
that twiddle the bits from user space. You can do that
with a shell script. If it was staying in the kernel
it could be probably moved into a proper error injection
framework that is not arbitarily tied to memory.
Lots of different devices have error injection
support and exposing some of that a in a general
frame work would likely make sense.

Anyways the old EDAC drivers for this are not going
away, you can still use them. The interesting
question though is how to properly define the interface
for new hardware.

> allow the setting of scrub rate? If we're just talking about

I never quite saw the point of that one, but yes
there's no replacement for this anywhere else.

Normally scrub rate can be simply set in the BIOS,
is that not good enough? Is there a use case for
changing it dynamically? 

Note that modern hardware typically has demand scrubbing
anyways, that is when there is an error it automatically
scrubs.

> replacing all instances of printk (when logging single bit
> errors) with perf events, I don't really see that as a problem.

I don't think perf is the right tool for this, the semantics
are mostly unsuitable (it hasn't been designed as a error reporting
tool, but as a performance tool and performance events are quite
different from errors) and it doesn't provide most of the infrastructure
needed for it anyways.

> But EDAC is much more than that today...

Well it's a hodge podge of quite a lot of odd bits.
I'm not sure "more" is the right word.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 11:49                           ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-14 11:49 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto, Luck, Tony,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Andi Kleen, Eric W. Biederman, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch

> Just left the above for reference. How would this affect other
> aspects of EDAC such as the error injection, the sysfs
> entries that (in most cases) reflect the layout of dimm's, and

Some of this can be probably retained, about the way EDAC
e.g. represents layout is quite unsuitable too. It includes
a lot of internal implementation details that in some cases
you can't even get anymore on modern design. Something
with a proper abstract interface is better.  EDAC never had this.

Also the biggest problem is still that EDAC doesn't
give you any silk screen labels, so unless you 
have motherboard schemantics the layout it presents
is fairly useless -- you still don't know which DIMM
to exchange. So in theory EDAC looks great, but in practice ...

On a lot of modern systems I checked DMI
seems reasonably accurate in terms of layout, so I suspect they can 
be handled with this. For others probably
still need some special driver, but one 
with a proper interface.

For error injection: some modern systems support this
though ACPI EINJ which has an separate non EDAC 
interface. For others I've been simply using some scripts
that twiddle the bits from user space. You can do that
with a shell script. If it was staying in the kernel
it could be probably moved into a proper error injection
framework that is not arbitarily tied to memory.
Lots of different devices have error injection
support and exposing some of that a in a general
frame work would likely make sense.

Anyways the old EDAC drivers for this are not going
away, you can still use them. The interesting
question though is how to properly define the interface
for new hardware.

> allow the setting of scrub rate? If we're just talking about

I never quite saw the point of that one, but yes
there's no replacement for this anywhere else.

Normally scrub rate can be simply set in the BIOS,
is that not good enough? Is there a use case for
changing it dynamically? 

Note that modern hardware typically has demand scrubbing
anyways, that is when there is an error it automatically
scrubs.

> replacing all instances of printk (when logging single bit
> errors) with perf events, I don't really see that as a problem.

I don't think perf is the right tool for this, the semantics
are mostly unsuitable (it hasn't been designed as a error reporting
tool, but as a performance tool and performance events are quite
different from errors) and it doesn't provide most of the infrastructure
needed for it anyways.

> But EDAC is much more than that today...

Well it's a hodge podge of quite a lot of odd bits.
I'm not sure "more" is the right word.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 11:49                           ` Andi Kleen
@ 2010-06-14 19:47                             ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-14 19:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto, Luck, Tony,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Eric W. Biederman, Doug Thompson, Joe Perches,
	Thomas Gleixner, Linux Edac Mailing List, Ingo Molnar,
	Matt Domsch

On Jun 14, 2010, at 1:49 PM, Andi Kleen wrote:

>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better.  EDAC never had this.
>
A lot of core edac doesn't reflect modern motherboards it's true.

> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...
>
I do have motherboard schematics, or rather, we build our own
boards. But the point is valid, a lot of people don't make their own
hardware. On the other hand, the people who do use this part of
EDAC perhaps aren't your typical home computer users?

> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can
> be handled with this. For others probably
> still need some special driver, but one
> with a proper interface.
>
> For error injection: some modern systems support this
> though ACPI EINJ which has an separate non EDAC
> interface. For others I've been simply using some scripts
> that twiddle the bits from user space. You can do that
> with a shell script. If it was staying in the kernel
> it could be probably moved into a proper error injection
> framework that is not arbitarily tied to memory.
> Lots of different devices have error injection
> support and exposing some of that a in a general
> frame work would likely make sense.
>

This is true, and this is the way things are going on
our end as well. I guess that would mean
one driver that hooks into all frameworks though?
So you wouldn't go to the EDAC sysfs directory
to find everything to do with the same piece of hardware
anymore, but would have to go the n different
directories looking for all the pieces? I don't really
like that...

> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.

But all new hardware will look the way the hardware
designers want it to, so our interface will be a moving
target? Maybe it's time to let hardware makers provide
a board specification with device tree and memory
layout? (Pure speculation)

>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically?
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.
>
There is a use-case. A lot has to do with how different patrol
scrub rates work, some just go through memory at a constant
speed (MB/s), others vary according to load. The thing is,
different applications want their memory scrubbed within
different time frames, and as the amount of memory on boards
varies and the bios doesn't vary this implies the need for setting
scrub rate from userspace.

Patrol scrubbing is normally used because it discovers errors
faster in seldom accessed memory allowing a DIMM with
too many errors to be replaced faster. Some applications
like to use demand scrubbing as well, and some consider
it to increase memory latency too much.

<snip>
>
>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.

Oh, a hodge podge is much more than just single bit
correctable error reporting... :-) You never know what
you'll find in the sysfs directory for a given memory
controller.

/Nils Carlson

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 19:47                             ` Nils Carlson
  0 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-14 19:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, Eric W. Biederman, Doug Thompson, Joe Perches,
	Ingo Molnar, Matt Domsch, bluesmoke-devel,
	Linux Edac Mailing List

On Jun 14, 2010, at 1:49 PM, Andi Kleen wrote:

>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better.  EDAC never had this.
>
A lot of core edac doesn't reflect modern motherboards it's true.

> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...
>
I do have motherboard schematics, or rather, we build our own
boards. But the point is valid, a lot of people don't make their own
hardware. On the other hand, the people who do use this part of
EDAC perhaps aren't your typical home computer users?

> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can
> be handled with this. For others probably
> still need some special driver, but one
> with a proper interface.
>
> For error injection: some modern systems support this
> though ACPI EINJ which has an separate non EDAC
> interface. For others I've been simply using some scripts
> that twiddle the bits from user space. You can do that
> with a shell script. If it was staying in the kernel
> it could be probably moved into a proper error injection
> framework that is not arbitarily tied to memory.
> Lots of different devices have error injection
> support and exposing some of that a in a general
> frame work would likely make sense.
>

This is true, and this is the way things are going on
our end as well. I guess that would mean
one driver that hooks into all frameworks though?
So you wouldn't go to the EDAC sysfs directory
to find everything to do with the same piece of hardware
anymore, but would have to go the n different
directories looking for all the pieces? I don't really
like that...

> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.

But all new hardware will look the way the hardware
designers want it to, so our interface will be a moving
target? Maybe it's time to let hardware makers provide
a board specification with device tree and memory
layout? (Pure speculation)

>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically?
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.
>
There is a use-case. A lot has to do with how different patrol
scrub rates work, some just go through memory at a constant
speed (MB/s), others vary according to load. The thing is,
different applications want their memory scrubbed within
different time frames, and as the amount of memory on boards
varies and the bios doesn't vary this implies the need for setting
scrub rate from userspace.

Patrol scrubbing is normally used because it discovers errors
faster in seldom accessed memory allowing a DIMM with
too many errors to be replaced faster. Some applications
like to use demand scrubbing as well, and some consider
it to increase memory latency too much.

<snip>
>
>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.

Oh, a hodge podge is much more than just single bit
correctable error reporting... :-) You never know what
you'll find in the sysfs directory for a given memory
controller.

/Nils Carlson

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 11:49                           ` Andi Kleen
@ 2010-06-14 20:06                             ` Eric W. Biederman
  -1 siblings, 0 replies; 108+ messages in thread
From: Eric W. Biederman @ 2010-06-14 20:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nils Carlson, Ingo Molnar, Borislav Petkov, Hidetoshi Seto, Luck,
	Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch

Andi Kleen <andi@firstfloor.org> writes:

>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better.  EDAC never had this.

It sounds like you can't be bothered to understand the EDAC code,
or the fact that some users actually like to know when their hardware
is having problems.

> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you 
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...

- In practice it works even without silk screen labels.
- The current EDAC code displays which DIMMS you have plugged
  in so you can tell if you unplug one, if it was the DIMM
  you were aiming at.

> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can 
> be handled with this. For others probably
> still need some special driver, but one 
> with a proper interface.

DMI is great on the days it works, there is a lot of variations
between BIOS's.  Also if the information is decent it can be
used to inform the current EDAC code as well as anything else.

You mean an interface that doesn't report the error so people
won't complain to you about a near useless kernel error
message.

> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.
>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically? 
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.

Setting the scrub rate isn't half so interesting as displaying
it.

Having basic hardware information displayed in sysfs seems to be the
design of the rest of linux.  I don't see abandoning that part of the
EDAC design as wise.

Displaying the fact that ECC is turned on in the hardware is one
of the more interesting bits.  That at least allows you to verify
that things are working.

>> replacing all instances of printk (when logging single bit
>> errors) with perf events, I don't really see that as a problem.
>
> I don't think perf is the right tool for this, the semantics
> are mostly unsuitable (it hasn't been designed as a error reporting
> tool, but as a performance tool and performance events are quite
> different from errors) and it doesn't provide most of the infrastructure
> needed for it anyways.

I will agree with that.  The argument that errors that should only
happen rarely need a high performance handler seems to indicate
there is some deep misunderstanding of the code.

>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.

If the basic errors could be posted in some kind of NMI/machine check
safe data structure it would not be hard to get EDAC drivers to
consume them.

Eric

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 20:06                             ` Eric W. Biederman
  0 siblings, 0 replies; 108+ messages in thread
From: Eric W. Biederman @ 2010-06-14 20:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, Matt Domsch, Doug Thompson, Joe Perches,
	Ingo Molnar, bluesmoke-devel, Linux Edac Mailing List

Andi Kleen <andi@firstfloor.org> writes:

>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better.  EDAC never had this.

It sounds like you can't be bothered to understand the EDAC code,
or the fact that some users actually like to know when their hardware
is having problems.

> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you 
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...

- In practice it works even without silk screen labels.
- The current EDAC code displays which DIMMS you have plugged
  in so you can tell if you unplug one, if it was the DIMM
  you were aiming at.

> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can 
> be handled with this. For others probably
> still need some special driver, but one 
> with a proper interface.

DMI is great on the days it works, there is a lot of variations
between BIOS's.  Also if the information is decent it can be
used to inform the current EDAC code as well as anything else.

You mean an interface that doesn't report the error so people
won't complain to you about a near useless kernel error
message.

> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.
>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically? 
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.

Setting the scrub rate isn't half so interesting as displaying
it.

Having basic hardware information displayed in sysfs seems to be the
design of the rest of linux.  I don't see abandoning that part of the
EDAC design as wise.

Displaying the fact that ECC is turned on in the hardware is one
of the more interesting bits.  That at least allows you to verify
that things are working.

>> replacing all instances of printk (when logging single bit
>> errors) with perf events, I don't really see that as a problem.
>
> I don't think perf is the right tool for this, the semantics
> are mostly unsuitable (it hasn't been designed as a error reporting
> tool, but as a performance tool and performance events are quite
> different from errors) and it doesn't provide most of the infrastructure
> needed for it anyways.

I will agree with that.  The argument that errors that should only
happen rarely need a high performance handler seems to indicate
there is some deep misunderstanding of the code.

>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.

If the basic errors could be posted in some kind of NMI/machine check
safe data structure it would not be hard to get EDAC drivers to
consume them.

Eric

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 19:47                             ` Nils Carlson
@ 2010-06-14 20:21                               ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-14 20:21 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Ingo Molnar, Borislav Petkov, Hidetoshi Seto, Luck,
	Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Eric W. Biederman,
	Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 09:47:33PM +0200, Nils Carlson wrote:
>> Also the biggest problem is still that EDAC doesn't
>> give you any silk screen labels, so unless you
>> have motherboard schemantics the layout it presents
>> is fairly useless -- you still don't know which DIMM
>> to exchange. So in theory EDAC looks great, but in practice ...
>>
> I do have motherboard schematics, or rather, we build our own
> boards. But the point is valid, a lot of people don't make their own

Just supply correct DMI tables then?

> hardware. On the other hand, the people who do use this part of
> EDAC perhaps aren't your typical home computer users?

Most users do not build their own boards and do not have
schemantics. And that's not home computer users.

Anyways I think important is that by default you get something
useful (including silk screen labels) without doing 
any special configuration steps.

Right now DMI is the only sane option for this that I can see.
EDAC doesn't do it because it has no silk screen labels.

And yes if someone is a power user they could still override
that. Just by default it has to do something reasonable.

>
> This is true, and this is the way things are going on
> our end as well. I guess that would mean
> So you wouldn't go to the EDAC sysfs directory
> to find everything to do with the same piece of hardware
> anymore, but would have to go the n different
> directories looking for all the pieces? I don't really
> like that...

Let me try to understand that.

You want to inject errors on a random computer you don't
know anything about? Do you do that frequently? Why
are you doing this? 

Obviously there needs to be a way to identify to what
hardware an error injector belongs.

>
>> Anyways the old EDAC drivers for this are not going
>> away, you can still use them. The interesting
>> question though is how to properly define the interface
>> for new hardware.
>
> But all new hardware will look the way the hardware
> designers want it to, so our interface will be a moving
> target? Maybe it's time to let hardware makers provide

You can define relatively abstract interfaces.

It's just that EDAC is not it. They may not be perfect
future proof (after all who knows how memories of quantum
computers or whatever will look like), but hopefully
at least reasonably forward looking.

e.g. for memory layout imho a reasonable way
is to just define it as

DIMM  (if you need below that look at a log) 
 \-------- silk screen label (most important attribute!)
 |
abstract path. This can be an arbitary string. e.g. MC0/Ch1/DIMM0
 |             Or MC0/BOB0/Ch1/DIMM3
 |             Parsers don't need to know any details about it.
 |
socket

You can event represent that as a flat data structure,
no need to really map the abstract path to directories
(that just makes parsers difficult to write -- most sysfs
parsers traditionally have trouble with varying directories)



> a board specification with device tree and memory
> layout? (Pure speculation)

That's DMI on x86! 

Well it's not perfect, but also not too bad.


> There is a use-case. A lot has to do with how different patrol
> scrub rates work, some just go through memory at a constant
> speed (MB/s), others vary according to load. The thing is,
> different applications want their memory scrubbed within
> different time frames, and as the amount of memory on boards

What's the theory behind varying scrub rates? 
I would be interested in more details.

> Patrol scrubbing is normally used because it discovers errors
> faster in seldom accessed memory allowing a DIMM with
> too many errors to be replaced faster. Some applications

Yes, but why do you want to vary the rate?
Normally it should just depend on memory size and expected
error rate (that is the more memory the faster you scrub) 

> like to use demand scrubbing as well, and some consider
> it to increase memory latency too much.

That sounds odd -- if you have so many errors that you worry
about that you have other problems definitely? 
Is this based on some benchmarking?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 20:21                               ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-14 20:21 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Ingo Molnar, Borislav Petkov, Hidetoshi Seto, Luck,
	Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Eric W. Biederman,
	Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 09:47:33PM +0200, Nils Carlson wrote:
>> Also the biggest problem is still that EDAC doesn't
>> give you any silk screen labels, so unless you
>> have motherboard schemantics the layout it presents
>> is fairly useless -- you still don't know which DIMM
>> to exchange. So in theory EDAC looks great, but in practice ...
>>
> I do have motherboard schematics, or rather, we build our own
> boards. But the point is valid, a lot of people don't make their own

Just supply correct DMI tables then?

> hardware. On the other hand, the people who do use this part of
> EDAC perhaps aren't your typical home computer users?

Most users do not build their own boards and do not have
schemantics. And that's not home computer users.

Anyways I think important is that by default you get something
useful (including silk screen labels) without doing 
any special configuration steps.

Right now DMI is the only sane option for this that I can see.
EDAC doesn't do it because it has no silk screen labels.

And yes if someone is a power user they could still override
that. Just by default it has to do something reasonable.

>
> This is true, and this is the way things are going on
> our end as well. I guess that would mean
> So you wouldn't go to the EDAC sysfs directory
> to find everything to do with the same piece of hardware
> anymore, but would have to go the n different
> directories looking for all the pieces? I don't really
> like that...

Let me try to understand that.

You want to inject errors on a random computer you don't
know anything about? Do you do that frequently? Why
are you doing this? 

Obviously there needs to be a way to identify to what
hardware an error injector belongs.

>
>> Anyways the old EDAC drivers for this are not going
>> away, you can still use them. The interesting
>> question though is how to properly define the interface
>> for new hardware.
>
> But all new hardware will look the way the hardware
> designers want it to, so our interface will be a moving
> target? Maybe it's time to let hardware makers provide

You can define relatively abstract interfaces.

It's just that EDAC is not it. They may not be perfect
future proof (after all who knows how memories of quantum
computers or whatever will look like), but hopefully
at least reasonably forward looking.

e.g. for memory layout imho a reasonable way
is to just define it as

DIMM  (if you need below that look at a log) 
 \-------- silk screen label (most important attribute!)
 |
abstract path. This can be an arbitary string. e.g. MC0/Ch1/DIMM0
 |             Or MC0/BOB0/Ch1/DIMM3
 |             Parsers don't need to know any details about it.
 |
socket

You can event represent that as a flat data structure,
no need to really map the abstract path to directories
(that just makes parsers difficult to write -- most sysfs
parsers traditionally have trouble with varying directories)



> a board specification with device tree and memory
> layout? (Pure speculation)

That's DMI on x86! 

Well it's not perfect, but also not too bad.


> There is a use-case. A lot has to do with how different patrol
> scrub rates work, some just go through memory at a constant
> speed (MB/s), others vary according to load. The thing is,
> different applications want their memory scrubbed within
> different time frames, and as the amount of memory on boards

What's the theory behind varying scrub rates? 
I would be interested in more details.

> Patrol scrubbing is normally used because it discovers errors
> faster in seldom accessed memory allowing a DIMM with
> too many errors to be replaced faster. Some applications

Yes, but why do you want to vary the rate?
Normally it should just depend on memory size and expected
error rate (that is the more memory the faster you scrub) 

> like to use demand scrubbing as well, and some consider
> it to increase memory latency too much.

That sounds odd -- if you have so many errors that you worry
about that you have other problems definitely? 
Is this based on some benchmarking?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: Hardware Error Kernel Mini-Summit
  2010-06-14 20:06                             ` Eric W. Biederman
@ 2010-06-14 20:21                               ` Luck, Tony
  -1 siblings, 0 replies; 108+ messages in thread
From: Luck, Tony @ 2010-06-14 20:21 UTC (permalink / raw)
  To: Eric W. Biederman, Andi Kleen
  Cc: Nils Carlson, Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

>- In practice it works even without silk screen labels.
>- The current EDAC code displays which DIMMS you have plugged
>  in so you can tell if you unplug one, if it was the DIMM
>  you were aiming at.

Potentially some unnecessary reboot cycles needed:
 - power off, pull a DIMM, power on, check with EDAC
 - repeat until you get the right DIMM

This also assumes that you can unplug just one DIMM. Some
motherboards require pairs of DIMMs to be added/removed together.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: Hardware Error Kernel Mini-Summit
@ 2010-06-14 20:21                               ` Luck, Tony
  0 siblings, 0 replies; 108+ messages in thread
From: Luck, Tony @ 2010-06-14 20:21 UTC (permalink / raw)
  To: Eric W. Biederman, Andi Kleen
  Cc: Nils Carlson, Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

>- In practice it works even without silk screen labels.
>- The current EDAC code displays which DIMMS you have plugged
>  in so you can tell if you unplug one, if it was the DIMM
>  you were aiming at.

Potentially some unnecessary reboot cycles needed:
 - power off, pull a DIMM, power on, check with EDAC
 - repeat until you get the right DIMM

This also assumes that you can unplug just one DIMM. Some
motherboards require pairs of DIMMs to be added/removed together.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 20:06                             ` Eric W. Biederman
@ 2010-06-14 20:36                               ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-14 20:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Nils Carlson, Ingo Molnar, Borislav Petkov,
	Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:

Hi Eric,

> - The current EDAC code displays which DIMMS you have plugged
>   in so you can tell if you unplug one, if it was the DIMM
>   you were aiming at.

Binary search for bad DIMMs. The way to handle memory errors in
the 21th century.

Obviously that does not really work, especially not on large
memory systems.

> > On a lot of modern systems I checked DMI
> > seems reasonably accurate in terms of layout, so I suspect they can 
> > be handled with this. For others probably
> > still need some special driver, but one 
> > with a proper interface.
> 
> DMI is great on the days it works, there is a lot of variations
> between BIOS's.  Also if the information is decent it can be
> used to inform the current EDAC code as well as anything else.

No DMI layout is unfortunately difficult to map to EDAC layout.
That's mostly EDAC's fault actually.

A sane EDAC replacement could be fed from DMI.

> You mean an interface that doesn't report the error so people
> won't complain to you about a near useless kernel error
> message.

DMI[1] does not report the errors, the errors are in machine checks
(or possibly other non architectural registers) 
DMI just gives you enumeration. It doesn't give everything,
but it's reasonably complete at least.


[1] except for the event log, but I'm not proposing to use that.
> 
> Setting the scrub rate isn't half so interesting as displaying
> it.

I still would like to understand the idea behind this varying 
at all. If you have any deeper thoughts on this please send them.

> 
> Having basic hardware information displayed in sysfs seems to be the
> design of the rest of linux.  I don't see abandoning that part of the
> EDAC design as wise.
> 
> Displaying the fact that ECC is turned on in the hardware is one
> of the more interesting bits.  That at least allows you to verify
> that things are working.

There are hundreds to thousands of BIOS level hardware knobs for memory
configuration (and if you count all BIOS knobs for everything far more) 

Why do you want to check a single bit only? (which is actually not
a single bit but also a lot of different ways to set this)

I can see there's a need to check that BIOS are doing the right
thing, but you'll never get that from a few sysfs fields.
You need a proper tool that is written for the system in question.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 20:36                               ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-14 20:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Andi Kleen,
	Thomas Gleixner, Matt Domsch, Doug Thompson, Joe Perches,
	Ingo Molnar, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:

Hi Eric,

> - The current EDAC code displays which DIMMS you have plugged
>   in so you can tell if you unplug one, if it was the DIMM
>   you were aiming at.

Binary search for bad DIMMs. The way to handle memory errors in
the 21th century.

Obviously that does not really work, especially not on large
memory systems.

> > On a lot of modern systems I checked DMI
> > seems reasonably accurate in terms of layout, so I suspect they can 
> > be handled with this. For others probably
> > still need some special driver, but one 
> > with a proper interface.
> 
> DMI is great on the days it works, there is a lot of variations
> between BIOS's.  Also if the information is decent it can be
> used to inform the current EDAC code as well as anything else.

No DMI layout is unfortunately difficult to map to EDAC layout.
That's mostly EDAC's fault actually.

A sane EDAC replacement could be fed from DMI.

> You mean an interface that doesn't report the error so people
> won't complain to you about a near useless kernel error
> message.

DMI[1] does not report the errors, the errors are in machine checks
(or possibly other non architectural registers) 
DMI just gives you enumeration. It doesn't give everything,
but it's reasonably complete at least.


[1] except for the event log, but I'm not proposing to use that.
> 
> Setting the scrub rate isn't half so interesting as displaying
> it.

I still would like to understand the idea behind this varying 
at all. If you have any deeper thoughts on this please send them.

> 
> Having basic hardware information displayed in sysfs seems to be the
> design of the rest of linux.  I don't see abandoning that part of the
> EDAC design as wise.
> 
> Displaying the fact that ECC is turned on in the hardware is one
> of the more interesting bits.  That at least allows you to verify
> that things are working.

There are hundreds to thousands of BIOS level hardware knobs for memory
configuration (and if you count all BIOS knobs for everything far more) 

Why do you want to check a single bit only? (which is actually not
a single bit but also a lot of different ways to set this)

I can see there's a need to check that BIOS are doing the right
thing, but you'll never get that from a few sysfs fields.
You need a proper tool that is written for the system in question.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 20:21                               ` Andi Kleen
  (?)
@ 2010-06-14 21:02                               ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-14 21:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hidetoshi Seto, Luck, Tony, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, Eric W. Biederman, Doug Thompson, Joe Perches,
	Ingo Molnar, Matt Domsch, bluesmoke-devel,
	Linux Edac Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 3403 bytes --]

On Jun 14, 2010, at 10:21 PM, Andi Kleen wrote:

> On Mon, Jun 14, 2010 at 09:47:33PM +0200, Nils Carlson wrote:
>>> Also the biggest problem is still that EDAC doesn't
>>> give you any silk screen labels, so unless you
>>> have motherboard schemantics the layout it presents
>>> is fairly useless -- you still don't know which DIMM
>>> to exchange. So in theory EDAC looks great, but in practice ...
>>>
>> I do have motherboard schematics, or rather, we build our own
>> boards. But the point is valid, a lot of people don't make their own
>
> Just supply correct DMI tables then?

Will ask the BIOS team what they're currently doing...
Unfortunately most of them have a windows world view,
and I know nothing of what windows does.
>
>>
>> This is true, and this is the way things are going on
>> our end as well. I guess that would mean
>> So you wouldn't go to the EDAC sysfs directory
>> to find everything to do with the same piece of hardware
>> anymore, but would have to go the n different
>> directories looking for all the pieces? I don't really
>> like that...
>
> Let me try to understand that.
>
> You want to inject errors on a random computer you don't
> know anything about? Do you do that frequently? Why
> are you doing this?

Didn't I mention above that we make our own boards?
>
>> There is a use-case. A lot has to do with how different patrol
>> scrub rates work, some just go through memory at a constant
>> speed (MB/s), others vary according to load. The thing is,
>> different applications want their memory scrubbed within
>> different time frames, and as the amount of memory on boards
>
> What's the theory behind varying scrub rates?
> I would be interested in more details.

Don't know what sort of theory you're looking for... Think you'd have
to contact Intel and AMD and ask. My guess is that they vary so that
scrubbing doesn't interfere with memory usage, that is, higher load
leads to slower scrubbing. We measure it as best we can under
different loads.

>
>> Patrol scrubbing is normally used because it discovers errors
>> faster in seldom accessed memory allowing a DIMM with
>> too many errors to be replaced faster. Some applications
>
> Yes, but why do you want to vary the rate?
> Normally it should just depend on memory size and expected
> error rate (that is the more memory the faster you scrub)

Because different applications have decided on different time
maximum times for scrubbing the memory. Why do applications
decide on different times? I don't know... But the telecoms business
is all about trade-offs and guaranteeing minimal down-time, so
normally it's a combination of system capacity weighted against
who knows what. If you're really interested I can ask some people,
I mostly provide hooks, no policy.

>
>> like to use demand scrubbing as well, and some consider
>> it to increase memory latency too much.
>
> That sounds odd -- if you have so many errors that you worry
> about that you have other problems definitely?
> Is this based on some benchmarking?
>
Demand scrubbing does increase memory latency, see for example
http://www.intel.com/products/server/chipsets/5100-memory-controller-hub/5100-memory-controller-hub-overview.htm

And other problems, well, yes. Our systems run in an embedded  
environment
where sending a service technician to replace a DIMM is very expensive.
So the faster errors are detected the better.

/Nils

[-- Attachment #1.2: Type: text/html, Size: 5527 bytes --]

[-- Attachment #2: Type: text/plain, Size: 287 bytes --]

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
bluesmoke-devel mailing list
bluesmoke-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 20:36                               ` Andi Kleen
@ 2010-06-14 21:34                                 ` Tony Luck
  -1 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-06-14 21:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Nils Carlson, Ingo Molnar, Borislav Petkov,
	Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 1:36 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:
>> Displaying the fact that ECC is turned on in the hardware is one
>> of the more interesting bits.  That at least allows you to verify
>> that things are working.
>
> There are hundreds to thousands of BIOS level hardware knobs for memory
> configuration (and if you count all BIOS knobs for everything far more)
>
> Why do you want to check a single bit only? (which is actually not
> a single bit but also a lot of different ways to set this)

There was a case mentioned at the collaboration summit
meeting where a BIOS bug mis-reported whether ECC was
enabled - claiming it was on, when in fact it was off.

Error injection could be used to check for another instance
of a lying BIOS (inject an error - make sure it gets counted).
Not as direct as seeing that the right bits are enabled in the
memory controller configuration registers, but still effective.
Perhaps more so as this technique validates different pieces
of the chipset specific code against each other. An EDAC
driver that tells you that ECC is enabled might be lying too,
if it is looking at the wrong bit or the wrong register.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-14 21:34                                 ` Tony Luck
  0 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-06-14 21:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, Eric W. Biederman, Doug Thompson, Joe Perches,
	Ingo Molnar, Matt Domsch, bluesmoke-devel,
	Linux Edac Mailing List

On Mon, Jun 14, 2010 at 1:36 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:
>> Displaying the fact that ECC is turned on in the hardware is one
>> of the more interesting bits.  That at least allows you to verify
>> that things are working.
>
> There are hundreds to thousands of BIOS level hardware knobs for memory
> configuration (and if you count all BIOS knobs for everything far more)
>
> Why do you want to check a single bit only? (which is actually not
> a single bit but also a lot of different ways to set this)

There was a case mentioned at the collaboration summit
meeting where a BIOS bug mis-reported whether ECC was
enabled - claiming it was on, when in fact it was off.

Error injection could be used to check for another instance
of a lying BIOS (inject an error - make sure it gets counted).
Not as direct as seeing that the right bits are enabled in the
memory controller configuration registers, but still effective.
Perhaps more so as this technique validates different pieces
of the chipset specific code against each other. An EDAC
driver that tells you that ECC is enabled might be lying too,
if it is looking at the wrong bit or the wrong register.

-Tony

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 21:34                                 ` Tony Luck
  (?)
@ 2010-06-14 23:46                                 ` Doug Thompson
  2010-06-15  6:56                                     ` Andi Kleen
  -1 siblings, 1 reply; 108+ messages in thread
From: Doug Thompson @ 2010-06-14 23:46 UTC (permalink / raw)
  To: Andi Kleen, Tony Luck
  Cc: Hidetoshi Seto, Mauro Carvalho Chehab, BrentYoung,
	Linux Kernel Mailing List, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, Eric W. Biederman, Doug Thompson, Joe Perches,
	Ingo Molnar, Matt Domsch, bluesmoke-devel,
	Linux Edac Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 252 bytes --]


Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.

doug t

[-- Attachment #1.2: Type: text/html, Size: 377 bytes --]

[-- Attachment #2: Type: text/plain, Size: 287 bytes --]

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
bluesmoke-devel mailing list
bluesmoke-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bluesmoke-devel

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 21:34                                 ` Tony Luck
@ 2010-06-15  6:44                                   ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15  6:44 UTC (permalink / raw)
  To: Tony Luck
  Cc: Andi Kleen, Eric W. Biederman, Nils Carlson, Ingo Molnar,
	Borislav Petkov, Hidetoshi Seto, Mauro Carvalho Chehab, Young,
	Brent, Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch

> There was a case mentioned at the collaboration summit
> meeting where a BIOS bug mis-reported whether ECC was
> enabled - claiming it was on, when in fact it was off.

Yes I heard about that, but since it's not a single bit setting
there are lots of different ways it could be broken in theory.

To check it you really need to have a tool that knows about
all the registers and checks them all.

It's a bit like checking if someone speaks a foreign language
by asking them a single question with a one letter answer.

> of the chipset specific code against each other. An EDAC
> driver that tells you that ECC is enabled might be lying too,
> if it is looking at the wrong bit or the wrong register.

Yep.

It's asking a question with a one word answer where you don't
know the correct answer.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15  6:44                                   ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15  6:44 UTC (permalink / raw)
  To: Tony Luck
  Cc: Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, Borislav Petkov, Andi Kleen,
	Thomas Gleixner, Eric W. Biederman, Doug Thompson, Joe Perches,
	Ingo Molnar, Matt Domsch, Ingo Molnar, bluesmoke-devel,
	Linux Edac Mailing List

> There was a case mentioned at the collaboration summit
> meeting where a BIOS bug mis-reported whether ECC was
> enabled - claiming it was on, when in fact it was off.

Yes I heard about that, but since it's not a single bit setting
there are lots of different ways it could be broken in theory.

To check it you really need to have a tool that knows about
all the registers and checks them all.

It's a bit like checking if someone speaks a foreign language
by asking them a single question with a one letter answer.

> of the chipset specific code against each other. An EDAC
> driver that tells you that ECC is enabled might be lying too,
> if it is looking at the wrong bit or the wrong register.

Yep.

It's asking a question with a one word answer where you don't
know the correct answer.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-14 23:46                                 ` Doug Thompson
@ 2010-06-15  6:56                                     ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15  6:56 UTC (permalink / raw)
  To: Doug Thompson
  Cc: Andi Kleen, Tony Luck, Eric W. Biederman, Nils Carlson,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:

Hi Doug,

> 
> Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.

The way I envision it to working is that a abstracted dimm interface
(or edac2 or whatever you want to call it) can be fed from any reasonable
DIMM layout driver. This could be either DMI on x86 or some other
driver. There would be nothing really x86 specific about that.

That said I think overall the focus for memory error handling
should focus on smart event handling, not dumb accounting.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15  6:56                                     ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15  6:56 UTC (permalink / raw)
  To: Doug Thompson
  Cc: Andi Kleen, Tony Luck, Eric W. Biederman, Nils Carlson,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:

Hi Doug,

> 
> Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.

The way I envision it to working is that a abstracted dimm interface
(or edac2 or whatever you want to call it) can be fed from any reasonable
DIMM layout driver. This could be either DMI on x86 or some other
driver. There would be nothing really x86 specific about that.

That said I think overall the focus for memory error handling
should focus on smart event handling, not dumb accounting.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15  6:56                                     ` Andi Kleen
@ 2010-06-15  8:06                                       ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15  8:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Doug Thompson, Tony Luck, Eric W. Biederman, Nils Carlson,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

On Tue, 15 Jun 2010, Andi Kleen wrote:

> On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:
>
> Hi Doug,
>
> >
> > Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.
>
> The way I envision it to working is that a abstracted dimm interface
> (or edac2 or whatever you want to call it) can be fed from any reasonable
> DIMM layout driver. This could be either DMI on x86 or some other
> driver. There would be nothing really x86 specific about that.

Could you maybe provide some references on how DIMM layout
could be read from DMI? I can't find anything nearly this specific,
or is it something we're expecting to happen in future BIOS's?

Also, there would probably need to be some standard describing
different DIMM layouts in general, though maybe such a thing exists.

In other words, there would be have to be some way of ascertaining
that the info you read from DMI is sufficient to decode MCEs so that
a faulting DIMM can be identified. In an ideal world, this could
be tested by some simple tool that could be run by the BIOS writers
to test that they're providing the OS with sufficient info.

/Nils

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15  8:06                                       ` Nils Carlson
  0 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15  8:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hidetoshi Seto, Tony Luck, Mauro Carvalho Chehab, BrentYoung,
	Nils Carlson, Linux Kernel Mailing List, Borislav Petkov,
	Ingo Molnar, Thomas Gleixner, Eric W. Biederman, Doug Thompson,
	Joe Perches, Doug Thompson, Ingo Molnar, Matt Domsch,
	bluesmoke-devel, Linux Edac Mailing List

On Tue, 15 Jun 2010, Andi Kleen wrote:

> On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:
>
> Hi Doug,
>
> >
> > Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.
>
> The way I envision it to working is that a abstracted dimm interface
> (or edac2 or whatever you want to call it) can be fed from any reasonable
> DIMM layout driver. This could be either DMI on x86 or some other
> driver. There would be nothing really x86 specific about that.

Could you maybe provide some references on how DIMM layout
could be read from DMI? I can't find anything nearly this specific,
or is it something we're expecting to happen in future BIOS's?

Also, there would probably need to be some standard describing
different DIMM layouts in general, though maybe such a thing exists.

In other words, there would be have to be some way of ascertaining
that the info you read from DMI is sufficient to decode MCEs so that
a faulting DIMM can be identified. In an ideal world, this could
be tested by some simple tool that could be run by the BIOS writers
to test that they're providing the OS with sufficient info.

/Nils

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15  8:06                                       ` Nils Carlson
@ 2010-06-15 10:01                                         ` Borislav Petkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-06-15 10:01 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Hidetoshi Seto, Mauro Carvalho Chehab, BrentYoung,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch, Nils Carlson

From: Nils Carlson <nils.carlson@ludd.ltu.se>
Date: Tue, Jun 15, 2010 at 04:06:33AM -0400

> On Tue, 15 Jun 2010, Andi Kleen wrote:
> 
> > On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:
> >
> > Hi Doug,
> >
> > >
> > > Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.
> >
> > The way I envision it to working is that a abstracted dimm interface
> > (or edac2 or whatever you want to call it) can be fed from any reasonable
> > DIMM layout driver. This could be either DMI on x86 or some other
> > driver. There would be nothing really x86 specific about that.
> 
> Could you maybe provide some references on how DIMM layout
> could be read from DMI? I can't find anything nearly this specific,
> or is it something we're expecting to happen in future BIOS's?
> 
> Also, there would probably need to be some standard describing
> different DIMM layouts in general, though maybe such a thing exists.
> 
> In other words, there would be have to be some way of ascertaining
> that the info you read from DMI is sufficient to decode MCEs so that
> a faulting DIMM can be identified. In an ideal world, this could
> be tested by some simple tool that could be run by the BIOS writers
> to test that they're providing the OS with sufficient info.

You cannot decode an ECC to a DIMM only using DMI info - at least on AMD
you cannot. The MCE contains the physical address where the ECC happened
and you need EDAC to convert this to a chip select row. Additionally,
you need the error syndrome depending on the dram controllers addressing
mode used.

Now, after you have the chip select row, you need to map this to a DIMM
rank and in order to do that, you need the DIMM info which is in the
SPD ROM (one of the data in the SPD is the DIMM rank which is needed
to unambiguously pinpoint which DIMM is generating those errors). Then
you can use the DMI info - assuming it contains the correct silk screen
labels on the motherboard - to map to a DIMM.

What currently EDAC does is decode the ECC to a chip select - what we
need is some I2C/SMBus code which can read the SPD ROM. I haven't had
the time to look into it yet, though.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 10:01                                         ` Borislav Petkov
  0 siblings, 0 replies; 108+ messages in thread
From: Borislav Petkov @ 2010-06-15 10:01 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Hidetoshi Seto, Mauro Carvalho Chehab, BrentYoung,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch, Nils Carlson

From: Nils Carlson <nils.carlson@ludd.ltu.se>
Date: Tue, Jun 15, 2010 at 04:06:33AM -0400

> On Tue, 15 Jun 2010, Andi Kleen wrote:
> 
> > On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:
> >
> > Hi Doug,
> >
> > >
> > > Maybe I didn't see it covered (or I missed it), but EDAC is used on more than just x86 based machines, though they are the majority by volume. We should have an abstraction that covers all the archs, like we do with other subsystems of Linux.
> >
> > The way I envision it to working is that a abstracted dimm interface
> > (or edac2 or whatever you want to call it) can be fed from any reasonable
> > DIMM layout driver. This could be either DMI on x86 or some other
> > driver. There would be nothing really x86 specific about that.
> 
> Could you maybe provide some references on how DIMM layout
> could be read from DMI? I can't find anything nearly this specific,
> or is it something we're expecting to happen in future BIOS's?
> 
> Also, there would probably need to be some standard describing
> different DIMM layouts in general, though maybe such a thing exists.
> 
> In other words, there would be have to be some way of ascertaining
> that the info you read from DMI is sufficient to decode MCEs so that
> a faulting DIMM can be identified. In an ideal world, this could
> be tested by some simple tool that could be run by the BIOS writers
> to test that they're providing the OS with sufficient info.

You cannot decode an ECC to a DIMM only using DMI info - at least on AMD
you cannot. The MCE contains the physical address where the ECC happened
and you need EDAC to convert this to a chip select row. Additionally,
you need the error syndrome depending on the dram controllers addressing
mode used.

Now, after you have the chip select row, you need to map this to a DIMM
rank and in order to do that, you need the DIMM info which is in the
SPD ROM (one of the data in the SPD is the DIMM rank which is needed
to unambiguously pinpoint which DIMM is generating those errors). Then
you can use the DMI info - assuming it contains the correct silk screen
labels on the motherboard - to map to a DIMM.

What currently EDAC does is decode the ECC to a chip select - what we
need is some I2C/SMBus code which can read the SPD ROM. I haven't had
the time to look into it yet, though.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15  8:06                                       ` Nils Carlson
@ 2010-06-15 11:41                                         ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15 11:41 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote:

Hi Nils,

> Could you maybe provide some references on how DIMM layout
> could be read from DMI? I can't find anything nearly this specific,
> or is it something we're expecting to happen in future BIOS's?

The hardware (or BIOS) tells you the DIMM. You read the DIMMs
from DMI and map them using the locators. The locator strings
are not standardized, but there are not too many different
formats around, so they can be implemented.

Again this does not give you full layout, but it gives
you a "path to a DIMM" and a DIMM locator. 

An alternative is also to use the ACPI based reporting
mechanism which is needed on some system. In this case
the CPER gives you a reference to the DMI object of the DIMM. 

In principle DMI has more information (arrays, ranges etc.)
but in my experience that is not strong enough to really find
the DIMM on modern systems. You need hardware or BIOS help for this.

This is implemented in mcelog today.

> 
> Also, there would probably need to be some standard describing
> different DIMM layouts in general, though maybe such a thing exists.

I don't think the goal is to have full DIMM layout. This will
never replace your schemantics.

The goal is to find which DIMM has a problem. So have a path
and a locator. The path may tell you some additional information
(e.g. channel), but that's hard to generalize.

> 
> In other words, there would be have to be some way of ascertaining
> that the info you read from DMI is sufficient to decode MCEs so that
> a faulting DIMM can be identified. In an ideal world, this could
> be tested by some simple tool that could be run by the BIOS writers
> to test that they're providing the OS with sufficient info.

That's difficult in a general way, you will probably always 
need some system specific test plan.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 11:41                                         ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15 11:41 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Hidetoshi Seto, Tony Luck, Mauro Carvalho Chehab, BrentYoung,
	Nils Carlson, Linux Kernel Mailing List, Borislav Petkov,
	Andi Kleen, Thomas Gleixner, Eric W. Biederman, Doug Thompson,
	Joe Perches, Doug Thompson, Ingo Molnar, Matt Domsch,
	Ingo Molnar, bluesmoke-devel, Linux Edac Mailing List

On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote:

Hi Nils,

> Could you maybe provide some references on how DIMM layout
> could be read from DMI? I can't find anything nearly this specific,
> or is it something we're expecting to happen in future BIOS's?

The hardware (or BIOS) tells you the DIMM. You read the DIMMs
from DMI and map them using the locators. The locator strings
are not standardized, but there are not too many different
formats around, so they can be implemented.

Again this does not give you full layout, but it gives
you a "path to a DIMM" and a DIMM locator. 

An alternative is also to use the ACPI based reporting
mechanism which is needed on some system. In this case
the CPER gives you a reference to the DMI object of the DIMM. 

In principle DMI has more information (arrays, ranges etc.)
but in my experience that is not strong enough to really find
the DIMM on modern systems. You need hardware or BIOS help for this.

This is implemented in mcelog today.

> 
> Also, there would probably need to be some standard describing
> different DIMM layouts in general, though maybe such a thing exists.

I don't think the goal is to have full DIMM layout. This will
never replace your schemantics.

The goal is to find which DIMM has a problem. So have a path
and a locator. The path may tell you some additional information
(e.g. channel), but that's hard to generalize.

> 
> In other words, there would be have to be some way of ascertaining
> that the info you read from DMI is sufficient to decode MCEs so that
> a faulting DIMM can be identified. In an ideal world, this could
> be tested by some simple tool that could be run by the BIOS writers
> to test that they're providing the OS with sufficient info.

That's difficult in a general way, you will probably always 
need some system specific test plan.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15 11:41                                         ` Andi Kleen
@ 2010-06-15 12:21                                           ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15 12:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nils Carlson, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

Hi Andi,

On Tue, 15 Jun 2010, Andi Kleen wrote:

> On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote:
>
> Hi Nils,
>
> > Could you maybe provide some references on how DIMM layout
> > could be read from DMI? I can't find anything nearly this specific,
> > or is it something we're expecting to happen in future BIOS's?
>
> The hardware (or BIOS) tells you the DIMM. You read the DIMMs
> from DMI and map them using the locators. The locator strings
> are not standardized, but there are not too many different
> formats around, so they can be implemented.
>
> Again this does not give you full layout, but it gives
> you a "path to a DIMM" and a DIMM locator.

Hmm.. From having a quick look at our boards I can conclude
that the information our BIOS puts in their is useless.
Will discuss durther with our BIOS writers. They do
their own error detection during the boot in which they
decode to DIMM's, so obviously the information is in there
(somewhere).

> An alternative is also to use the ACPI based reporting
> mechanism which is needed on some system. In this case
> the CPER gives you a reference to the DMI object of the DIMM.
>
> In principle DMI has more information (arrays, ranges etc.)
> but in my experience that is not strong enough to really find
> the DIMM on modern systems. You need hardware or BIOS help for this.

So what are we left with? Non-standardised locator strings
that may or may not be present at the mercy of the bios-writer?
I'm already feeling depressed. Re-writing EDAC to try to
make sense of this information seems overly risky.

I think in general that this is one of the wonderfull things
about linux, you're not so much at the mercy of BIOS-writers.
As soon as we start relying on the BIOS for functionality we're
encouraging the BIOS people to put more functionality in there,
and BIOS functionality is great, as long as there are no bugs!

But there are bugs. And correcting them is so prohibitively
expensive that I don't even want to think about it. And when
the BIOS messes up, it's the device driver writers who have to
magically workaround the problems.

Could we come up with some plan that doesn't involve
trusting to the goodwill (and competence) of BIOS writes?

I personally really like the device tree compiler for PowerPC.
It allows you to be explicit about what you have. Not for everyone,
but maybe there could be some way to apply the same principle? Maybe
some way of loading modules with parameters or configuring your setup
from sysfs?

/Nils

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 12:21                                           ` Nils Carlson
  0 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15 12:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nils Carlson, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

Hi Andi,

On Tue, 15 Jun 2010, Andi Kleen wrote:

> On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote:
>
> Hi Nils,
>
> > Could you maybe provide some references on how DIMM layout
> > could be read from DMI? I can't find anything nearly this specific,
> > or is it something we're expecting to happen in future BIOS's?
>
> The hardware (or BIOS) tells you the DIMM. You read the DIMMs
> from DMI and map them using the locators. The locator strings
> are not standardized, but there are not too many different
> formats around, so they can be implemented.
>
> Again this does not give you full layout, but it gives
> you a "path to a DIMM" and a DIMM locator.

Hmm.. From having a quick look at our boards I can conclude
that the information our BIOS puts in their is useless.
Will discuss durther with our BIOS writers. They do
their own error detection during the boot in which they
decode to DIMM's, so obviously the information is in there
(somewhere).

> An alternative is also to use the ACPI based reporting
> mechanism which is needed on some system. In this case
> the CPER gives you a reference to the DMI object of the DIMM.
>
> In principle DMI has more information (arrays, ranges etc.)
> but in my experience that is not strong enough to really find
> the DIMM on modern systems. You need hardware or BIOS help for this.

So what are we left with? Non-standardised locator strings
that may or may not be present at the mercy of the bios-writer?
I'm already feeling depressed. Re-writing EDAC to try to
make sense of this information seems overly risky.

I think in general that this is one of the wonderfull things
about linux, you're not so much at the mercy of BIOS-writers.
As soon as we start relying on the BIOS for functionality we're
encouraging the BIOS people to put more functionality in there,
and BIOS functionality is great, as long as there are no bugs!

But there are bugs. And correcting them is so prohibitively
expensive that I don't even want to think about it. And when
the BIOS messes up, it's the device driver writers who have to
magically workaround the problems.

Could we come up with some plan that doesn't involve
trusting to the goodwill (and competence) of BIOS writes?

I personally really like the device tree compiler for PowerPC.
It allows you to be explicit about what you have. Not for everyone,
but maybe there could be some way to apply the same principle? Maybe
some way of loading modules with parameters or configuring your setup
from sysfs?

/Nils

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: Hardware Error Kernel Mini-Summit
  2010-06-15 12:21                                           ` Nils Carlson
@ 2010-06-15 18:15                                             ` Luck, Tony
  -1 siblings, 0 replies; 108+ messages in thread
From: Luck, Tony @ 2010-06-15 18:15 UTC (permalink / raw)
  To: Nils Carlson, Andi Kleen
  Cc: Doug Thompson, Eric W. Biederman, Ingo Molnar, Borislav Petkov,
	Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch, Nils Carlson

> Could we come up with some plan that doesn't involve
> trusting to the goodwill (and competence) of BIOS writes?

That would be nice - but there already exists a platform
(Xeon-7500 series a.k.a. Nehalem-EX) where the hardware
chipset registers that you would need to do your own
memory topology reverse engineering in Linux are only
accessible to SMM level code.  I've finally come to the
conclusion that an EDAC style driver just isn't possible
for this set of systems.

>I personally really like the device tree compiler for PowerPC.
>It allows you to be explicit about what you have. Not for everyone,
>but maybe there could be some way to apply the same principle? Maybe
>some way of loading modules with parameters or configuring your setup
>from sysfs?

Even when the chip set registers are accessible, it can be very
complex to do this for the general case (think of boards that
support arbitrary mixing of different size/speed DIMMs - the
BIOS may have done some interesting somersaults while computing
which interleaving modes to use).

Even more complex on high end systems when BIOS may handle row
sparing transparently to the OS. Memory mirroring is also
becoming fashionable - how can EDAC represent this (when
the h/w view of the memory doesn't match the OS view)?

-Tony



^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: Hardware Error Kernel Mini-Summit
@ 2010-06-15 18:15                                             ` Luck, Tony
  0 siblings, 0 replies; 108+ messages in thread
From: Luck, Tony @ 2010-06-15 18:15 UTC (permalink / raw)
  To: Nils Carlson, Andi Kleen
  Cc: Doug Thompson, Eric W. Biederman, Ingo Molnar, Borislav Petkov,
	Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch, Nils Carlson

> Could we come up with some plan that doesn't involve
> trusting to the goodwill (and competence) of BIOS writes?

That would be nice - but there already exists a platform
(Xeon-7500 series a.k.a. Nehalem-EX) where the hardware
chipset registers that you would need to do your own
memory topology reverse engineering in Linux are only
accessible to SMM level code.  I've finally come to the
conclusion that an EDAC style driver just isn't possible
for this set of systems.

>I personally really like the device tree compiler for PowerPC.
>It allows you to be explicit about what you have. Not for everyone,
>but maybe there could be some way to apply the same principle? Maybe
>some way of loading modules with parameters or configuring your setup
>from sysfs?

Even when the chip set registers are accessible, it can be very
complex to do this for the general case (think of boards that
support arbitrary mixing of different size/speed DIMMs - the
BIOS may have done some interesting somersaults while computing
which interleaving modes to use).

Even more complex on high end systems when BIOS may handle row
sparing transparently to the OS. Memory mirroring is also
becoming fashionable - how can EDAC represent this (when
the h/w view of the memory doesn't match the OS view)?

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15 18:15                                             ` Luck, Tony
@ 2010-06-15 18:38                                               ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15 18:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andi Kleen, Doug Thompson, Eric W. Biederman, Ingo Molnar,
	Borislav Petkov, Hidetoshi Seto, Mauro Carvalho Chehab, Young,
	Brent, Linux Kernel Mailing List, bluesmoke-devel, Doug Thompson,
	Joe Perches, Thomas Gleixner, Linux Edac Mailing List,
	Ingo Molnar, Matt Domsch, Nils Carlson

On Jun 15, 2010, at 8:15 PM, Luck, Tony wrote:

>> Could we come up with some plan that doesn't involve
>> trusting to the goodwill (and competence) of BIOS writes?
>
> That would be nice - but there already exists a platform
> (Xeon-7500 series a.k.a. Nehalem-EX) where the hardware
> chipset registers that you would need to do your own
> memory topology reverse engineering in Linux are only
> accessible to SMM level code.  I've finally come to the
> conclusion that an EDAC style driver just isn't possible
> for this set of systems.
Yes, I'm dreading the day they come to me telling me that
they've got one of those. On the one end you have hardware
people who love to put functionality there, and then you
have applications that have real-time requirements to
whom you have to explain that the latest and greatest
processor is broken for their purposes.

One day I'll use this as an excuse to migrate everyone to
PPC where people know that a bootloader is a bootloader.

But grudges against BIOS's aside, I don't know what to do
about Nehalem-EX systems. I guess at that point we really
are at the mercy of BIOS writers.
>
>> I personally really like the device tree compiler for PowerPC.
>> It allows you to be explicit about what you have. Not for everyone,
>> but maybe there could be some way to apply the same principle? Maybe
>> some way of loading modules with parameters or configuring your setup
>> from sysfs?
>
> Even when the chip set registers are accessible, it can be very
> complex to do this for the general case (think of boards that
> support arbitrary mixing of different size/speed DIMMs - the
> BIOS may have done some interesting somersaults while computing
> which interleaving modes to use).
>
> Even more complex on high end systems when BIOS may handle row
> sparing transparently to the OS. Memory mirroring is also
> becoming fashionable - how can EDAC represent this (when
> the h/w view of the memory doesn't match the OS view)?
>
Difficult questions. But at some point I wonder who will be buying
systems where finding out which DIMM is broken is so complex
that it requires a masters degree.

/Nils

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 18:38                                               ` Nils Carlson
  0 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15 18:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Nils Carlson, Linux Kernel Mailing List, Borislav Petkov,
	Andi Kleen, Thomas Gleixner, Eric W. Biederman, Doug Thompson,
	Joe Perches, Doug Thompson, Ingo Molnar, Matt Domsch,
	Ingo Molnar, bluesmoke-devel, Linux Edac Mailing List

On Jun 15, 2010, at 8:15 PM, Luck, Tony wrote:

>> Could we come up with some plan that doesn't involve
>> trusting to the goodwill (and competence) of BIOS writes?
>
> That would be nice - but there already exists a platform
> (Xeon-7500 series a.k.a. Nehalem-EX) where the hardware
> chipset registers that you would need to do your own
> memory topology reverse engineering in Linux are only
> accessible to SMM level code.  I've finally come to the
> conclusion that an EDAC style driver just isn't possible
> for this set of systems.
Yes, I'm dreading the day they come to me telling me that
they've got one of those. On the one end you have hardware
people who love to put functionality there, and then you
have applications that have real-time requirements to
whom you have to explain that the latest and greatest
processor is broken for their purposes.

One day I'll use this as an excuse to migrate everyone to
PPC where people know that a bootloader is a bootloader.

But grudges against BIOS's aside, I don't know what to do
about Nehalem-EX systems. I guess at that point we really
are at the mercy of BIOS writers.
>
>> I personally really like the device tree compiler for PowerPC.
>> It allows you to be explicit about what you have. Not for everyone,
>> but maybe there could be some way to apply the same principle? Maybe
>> some way of loading modules with parameters or configuring your setup
>> from sysfs?
>
> Even when the chip set registers are accessible, it can be very
> complex to do this for the general case (think of boards that
> support arbitrary mixing of different size/speed DIMMs - the
> BIOS may have done some interesting somersaults while computing
> which interleaving modes to use).
>
> Even more complex on high end systems when BIOS may handle row
> sparing transparently to the OS. Memory mirroring is also
> becoming fashionable - how can EDAC represent this (when
> the h/w view of the memory doesn't match the OS view)?
>
Difficult questions. But at some point I wonder who will be buying
systems where finding out which DIMM is broken is so complex
that it requires a masters degree.

/Nils

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15 12:21                                           ` Nils Carlson
@ 2010-06-15 19:35                                             ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15 19:35 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

> But there are bugs. And correcting them is so prohibitively
> expensive that I don't even want to think about it. And when

Something is wrong in your setup then.

> the BIOS messes up, it's the device driver writers who have to
> magically workaround the problems.

In this case you would need the equivalent information
of a system specific DMI table in some device driver.

Do you see how this does not fly? How should a device
driver know more about the system than the BIOS?

And if you can load some specific table into the device
driver why can't you simply update the BIOS too?

Well you can supply your own if you're a power user
anyways, but most users are not power users. So it's no 
option as a default.

Or could you imagine a standard server getting installed
and asking with a desktop window "please enter the DIMM mappings
by hand"? That simply doesn't make any sense.


> 
> Could we come up with some plan that doesn't involve
> trusting to the goodwill (and competence) of BIOS writes?

the problem is that the information is nowhere else.
If the BIOS doesn't know it Linux certainly doesn't know it either.

On the other hand if Linux uses this information there is certainly
an angle to get at least server vendors to fix their stuff
(and non servers do not matter for memory errors because they
run in non ECC mode anyways)

It's certainly in the server vendors own interest to supply correct
information here anyways. If they don't it will cost them in
unnecessary memory replacement costs.

BTW on the systems I have access to DMI seems to be largely 
correct these days. I guess your system is a unlucky exception.

Maybe your BIOS people will do something useful next generation.
Make sure to report it to them and if they don't fix it make fun of them.


> but maybe there could be some way to apply the same principle? Maybe
> some way of loading modules with parameters or configuring your setup
> from sysfs?

Having a DMI override is no problem at all. ACPI uses this all the time
for example.

No need at all to speak a foreign language for this, even if it's your
mother tongue.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 19:35                                             ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15 19:35 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

> But there are bugs. And correcting them is so prohibitively
> expensive that I don't even want to think about it. And when

Something is wrong in your setup then.

> the BIOS messes up, it's the device driver writers who have to
> magically workaround the problems.

In this case you would need the equivalent information
of a system specific DMI table in some device driver.

Do you see how this does not fly? How should a device
driver know more about the system than the BIOS?

And if you can load some specific table into the device
driver why can't you simply update the BIOS too?

Well you can supply your own if you're a power user
anyways, but most users are not power users. So it's no 
option as a default.

Or could you imagine a standard server getting installed
and asking with a desktop window "please enter the DIMM mappings
by hand"? That simply doesn't make any sense.


> 
> Could we come up with some plan that doesn't involve
> trusting to the goodwill (and competence) of BIOS writes?

the problem is that the information is nowhere else.
If the BIOS doesn't know it Linux certainly doesn't know it either.

On the other hand if Linux uses this information there is certainly
an angle to get at least server vendors to fix their stuff
(and non servers do not matter for memory errors because they
run in non ECC mode anyways)

It's certainly in the server vendors own interest to supply correct
information here anyways. If they don't it will cost them in
unnecessary memory replacement costs.

BTW on the systems I have access to DMI seems to be largely 
correct these days. I guess your system is a unlucky exception.

Maybe your BIOS people will do something useful next generation.
Make sure to report it to them and if they don't fix it make fun of them.


> but maybe there could be some way to apply the same principle? Maybe
> some way of loading modules with parameters or configuring your setup
> from sysfs?

Having a DMI override is no problem at all. ACPI uses this all the time
for example.

No need at all to speak a foreign language for this, even if it's your
mother tongue.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15 18:15                                             ` Luck, Tony
@ 2010-06-15 19:37                                               ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15 19:37 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Nils Carlson, Andi Kleen, Doug Thompson, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, Young, Brent, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

> Even when the chip set registers are accessible, it can be very
> complex to do this for the general case (think of boards that
> support arbitrary mixing of different size/speed DIMMs - the
> BIOS may have done some interesting somersaults while computing
> which interleaving modes to use).

... and the numbers that come out of this may have no relation
to your motherboard labels at all. What do you do then? 
Read schemantics again?  Or do binary search on the DIMM again
like Eric suggested? 

For all of this a system specific mapping table is needed
and the only place to get this as a default option without 
explicit configuration for each motherboard is the BIOS.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 19:37                                               ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-15 19:37 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hidetoshi Seto, Mauro Carvalho Chehab, Young, Brent,
	Nils Carlson, Linux Kernel Mailing List, Borislav Petkov,
	Andi Kleen, Thomas Gleixner, Eric W. Biederman, Doug Thompson,
	Joe Perches, Doug Thompson, Ingo Molnar, Matt Domsch,
	Ingo Molnar, bluesmoke-devel, Linux Edac Mailing List

> Even when the chip set registers are accessible, it can be very
> complex to do this for the general case (think of boards that
> support arbitrary mixing of different size/speed DIMMs - the
> BIOS may have done some interesting somersaults while computing
> which interleaving modes to use).

... and the numbers that come out of this may have no relation
to your motherboard labels at all. What do you do then? 
Read schemantics again?  Or do binary search on the DIMM again
like Eric suggested? 

For all of this a system specific mapping table is needed
and the only place to get this as a default option without 
explicit configuration for each motherboard is the BIOS.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15 19:35                                             ` Andi Kleen
@ 2010-06-15 20:48                                               ` Nils Carlson
  -1 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15 20:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Doug Thompson, Tony Luck, Eric W. Biederman, Ingo Molnar,
	Borislav Petkov, Hidetoshi Seto, Mauro Carvalho Chehab,
	BrentYoung, Linux Kernel Mailing List, bluesmoke-devel,
	Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson

On Jun 15, 2010, at 9:35 PM, Andi Kleen wrote:

>> But there are bugs. And correcting them is so prohibitively
>> expensive that I don't even want to think about it. And when
>
> Something is wrong in your setup then.

No, something is wrong with the BIOS. ;-)

<long snip>

Could you maybe give me an example from the board of your choosing
of a DMI table print out, explain the format and then show how to use  
it?
I'd like to show it to our BIOS writers. Ideally, maybe somebody could
post a good suggestion for a standardized format? (Though that's very
optimistic, but maybe cpu vendors can make suggestions to board makers?)

/Nils

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 20:48                                               ` Nils Carlson
  0 siblings, 0 replies; 108+ messages in thread
From: Nils Carlson @ 2010-06-15 20:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hidetoshi Seto, Tony Luck, Mauro Carvalho Chehab, BrentYoung,
	Nils Carlson, Linux Kernel Mailing List, Borislav Petkov,
	Ingo Molnar, Thomas Gleixner, Eric W. Biederman, Doug Thompson,
	Joe Perches, Doug Thompson, Ingo Molnar, Matt Domsch,
	bluesmoke-devel, Linux Edac Mailing List

On Jun 15, 2010, at 9:35 PM, Andi Kleen wrote:

>> But there are bugs. And correcting them is so prohibitively
>> expensive that I don't even want to think about it. And when
>
> Something is wrong in your setup then.

No, something is wrong with the BIOS. ;-)

<long snip>

Could you maybe give me an example from the board of your choosing
of a DMI table print out, explain the format and then show how to use  
it?
I'd like to show it to our BIOS writers. Ideally, maybe somebody could
post a good suggestion for a standardized format? (Though that's very
optimistic, but maybe cpu vendors can make suggestions to board makers?)

/Nils

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15  6:56                                     ` Andi Kleen
@ 2010-06-15 22:33                                       ` Tony Luck
  -1 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-06-15 22:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Doug Thompson, Eric W. Biederman, Nils Carlson, Ingo Molnar,
	Borislav Petkov, Hidetoshi Seto, Mauro Carvalho Chehab,
	BrentYoung, Linux Kernel Mailing List, bluesmoke-devel,
	Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 11:56 PM, Andi Kleen <andi@firstfloor.org> wrote:
> The way I envision it to working is that a abstracted dimm interface
> (or edac2 or whatever you want to call it) can be fed from any reasonable
> DIMM layout driver. This could be either DMI on x86 or some other
> driver. There would be nothing really x86 specific about that.

You could go one stage further and make DIMMs just one example of
a field replaceable unit.  So the "error analysis subsystem" would keep track
of errors reported by any component (cpu, DIMM, I/O card, fan, power
supply, disk, ...).  Each category could have different "X errors per Y
interval" parameter that made sense for it.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-15 22:33                                       ` Tony Luck
  0 siblings, 0 replies; 108+ messages in thread
From: Tony Luck @ 2010-06-15 22:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Doug Thompson, Eric W. Biederman, Nils Carlson, Ingo Molnar,
	Borislav Petkov, Hidetoshi Seto, Mauro Carvalho Chehab,
	BrentYoung, Linux Kernel Mailing List, bluesmoke-devel,
	Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch

On Mon, Jun 14, 2010 at 11:56 PM, Andi Kleen <andi@firstfloor.org> wrote:
> The way I envision it to working is that a abstracted dimm interface
> (or edac2 or whatever you want to call it) can be fed from any reasonable
> DIMM layout driver. This could be either DMI on x86 or some other
> driver. There would be nothing really x86 specific about that.

You could go one stage further and make DIMMs just one example of
a field replaceable unit.  So the "error analysis subsystem" would keep track
of errors reported by any component (cpu, DIMM, I/O card, fan, power
supply, disk, ...).  Each category could have different "X errors per Y
interval" parameter that made sense for it.

-Tony

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
  2010-06-15 20:48                                               ` Nils Carlson
@ 2010-06-16  9:40                                                 ` Andi Kleen
  -1 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-16  9:40 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Andi Kleen, Doug Thompson, Tony Luck, Eric W. Biederman,
	Ingo Molnar, Borislav Petkov, Hidetoshi Seto,
	Mauro Carvalho Chehab, BrentYoung, Linux Kernel Mailing List,
	bluesmoke-devel, Doug Thompson, Joe Perches, Thomas Gleixner,
	Linux Edac Mailing List, Ingo Molnar, Matt Domsch, Nils Carlson


> Could you maybe give me an example from the board of your choosing
> of a DMI table print out, explain the format and then show how to use it?

The only requirement the current mcelog parser has is
(that is what it actually uses, it parses more things but I abandoned
them):

- List of DIMMs (type 17) 
- It's useful if they have the correct size for display to the user.
- Correct serial/part numbers/manufacturer are also useful (for display), but
not strictly required.
- Locator should match the silk screen label of the DIMM on the board
- Bank Locator is in the format prefix_Node%u_Channel%u_Dimm%u
prefix can be arbitary, but should not contain '_'
Node matching SOCKETID coming from CPU, Channel matching Channel, Dimm
matching Dimm number from CPU.
This requirement is the only extension over the standard.

-Andi

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-16  9:40                                                 ` Andi Kleen
  0 siblings, 0 replies; 108+ messages in thread
From: Andi Kleen @ 2010-06-16  9:40 UTC (permalink / raw)
  To: Nils Carlson
  Cc: Hidetoshi Seto, Tony Luck, Mauro Carvalho Chehab, BrentYoung,
	Nils Carlson, Linux Kernel Mailing List, Borislav Petkov,
	Andi Kleen, Thomas Gleixner, Eric W. Biederman, Doug Thompson,
	Joe Perches, Doug Thompson, Ingo Molnar, Matt Domsch,
	Ingo Molnar, bluesmoke-devel, Linux Edac Mailing List


> Could you maybe give me an example from the board of your choosing
> of a DMI table print out, explain the format and then show how to use it?

The only requirement the current mcelog parser has is
(that is what it actually uses, it parses more things but I abandoned
them):

- List of DIMMs (type 17) 
- It's useful if they have the correct size for display to the user.
- Correct serial/part numbers/manufacturer are also useful (for display), but
not strictly required.
- Locator should match the silk screen label of the DIMM on the board
- Bank Locator is in the format prefix_Node%u_Channel%u_Dimm%u
prefix can be arbitary, but should not contain '_'
Node matching SOCKETID coming from CPU, Channel matching Channel, Dimm
matching Dimm number from CPU.
This requirement is the only extension over the standard.

-Andi

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: Hardware Error Kernel Mini-Summit
@ 2010-06-16  8:57 George Spelvin
  0 siblings, 0 replies; 108+ messages in thread
From: George Spelvin @ 2010-06-16  8:57 UTC (permalink / raw)
  To: andi; +Cc: linux, linux-kernel, tony.luck

> ...and the numbers that come out of this may have no relation
> to your motherboard labels at all. What do you do then? 
> Read schemantics again?  Or do binary search on the DIMM again
> like Eric suggested? 

Or make a guess and see if it works, or STFW for someone else who figured it
out, or read chipset docs, or see if the SPD addresses correspond in a
sensible way, or extrapolate from another board with the same NB and
BIOS vendor, or...

One of the big advantages of Linux/*BSD over certain other x86 operating
systems is they don't try so hard to be user-friendly that they're
expert-hostile.  "Your computer is fucked up, please consult your
system administrator" isn't much help to the poor system administrator.

So while it's fine to *try* to translate to a DIMM slot using BIOS info,
please also report whatever can be figured out without trusting the BIOS.
So if I find it's in error, I can report everything needed to write
an override table entry.

Without the BIOS, it's a PITA, but the correspondence between /CSx lines
and DIMM slot locations can be figured out, and if necessary, I can publish
a web page with the mappings.

	"We do not trust BIOS tables, because BIOS writers are invariably
	totally incompetent crack-addicted monkeys. If they weren't,
	they wouldn't be BIOS writers. QED."		-- Linus
		http://marc.info/?l=linux-kernel&m=127498023108564

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2010-06-16  9:40 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
2010-05-17 22:41 ` Andi Kleen
2010-05-18 16:50   ` Mauro Carvalho Chehab
2010-05-18 18:10     ` Andi Kleen
2010-05-18 18:10       ` Andi Kleen
2010-05-18  6:52 ` Hidetoshi Seto
2010-05-18  6:52   ` Hidetoshi Seto
2010-05-18 16:44   ` Mauro Carvalho Chehab
2010-05-18 16:44     ` Mauro Carvalho Chehab
2010-05-18 17:42     ` Joe Perches
2010-05-18 17:59       ` Mauro Carvalho Chehab
2010-05-18 18:45       ` Andi Kleen
2010-05-18 18:57         ` Joe Perches
2010-05-18 18:53       ` Ingo Molnar
2010-05-18 19:08         ` Luck, Tony
2010-05-18 19:08           ` Luck, Tony
2010-05-18 19:18           ` Borislav Petkov
2010-05-18 19:18             ` Borislav Petkov
2010-05-18 19:34             ` Ingo Molnar
2010-05-18 19:34               ` Ingo Molnar
2010-05-18 22:14             ` Eric W. Biederman
2010-05-18 22:14               ` Eric W. Biederman
2010-05-18 22:28               ` Andi Kleen
2010-05-18 22:28                 ` Andi Kleen
2010-05-19  1:14                 ` Eric W. Biederman
2010-05-19  1:14                   ` Eric W. Biederman
2010-05-19  6:46                   ` Borislav Petkov
2010-05-19  6:46                     ` Borislav Petkov
2010-05-19  7:09                     ` Ingo Molnar
2010-05-19  7:09                       ` Ingo Molnar
2010-05-19 11:54                       ` Mauro Carvalho Chehab
2010-05-19 11:54                         ` Mauro Carvalho Chehab
2010-05-20 12:37                         ` Ingo Molnar
2010-05-20 12:37                           ` Ingo Molnar
2010-06-14 10:03                       ` Nils Carlson
2010-06-14 10:03                         ` Nils Carlson
2010-06-14 11:49                         ` Andi Kleen
2010-06-14 11:49                           ` Andi Kleen
2010-06-14 19:47                           ` Nils Carlson
2010-06-14 19:47                             ` Nils Carlson
2010-06-14 20:21                             ` Andi Kleen
2010-06-14 20:21                               ` Andi Kleen
2010-06-14 21:02                               ` Nils Carlson
2010-06-14 20:06                           ` Eric W. Biederman
2010-06-14 20:06                             ` Eric W. Biederman
2010-06-14 20:21                             ` Luck, Tony
2010-06-14 20:21                               ` Luck, Tony
2010-06-14 20:36                             ` Andi Kleen
2010-06-14 20:36                               ` Andi Kleen
2010-06-14 21:34                               ` Tony Luck
2010-06-14 21:34                                 ` Tony Luck
2010-06-14 23:46                                 ` Doug Thompson
2010-06-15  6:56                                   ` Andi Kleen
2010-06-15  6:56                                     ` Andi Kleen
2010-06-15  8:06                                     ` Nils Carlson
2010-06-15  8:06                                       ` Nils Carlson
2010-06-15 10:01                                       ` Borislav Petkov
2010-06-15 10:01                                         ` Borislav Petkov
2010-06-15 11:41                                       ` Andi Kleen
2010-06-15 11:41                                         ` Andi Kleen
2010-06-15 12:21                                         ` Nils Carlson
2010-06-15 12:21                                           ` Nils Carlson
2010-06-15 18:15                                           ` Luck, Tony
2010-06-15 18:15                                             ` Luck, Tony
2010-06-15 18:38                                             ` Nils Carlson
2010-06-15 18:38                                               ` Nils Carlson
2010-06-15 19:37                                             ` Andi Kleen
2010-06-15 19:37                                               ` Andi Kleen
2010-06-15 19:35                                           ` Andi Kleen
2010-06-15 19:35                                             ` Andi Kleen
2010-06-15 20:48                                             ` Nils Carlson
2010-06-15 20:48                                               ` Nils Carlson
2010-06-16  9:40                                               ` Andi Kleen
2010-06-16  9:40                                                 ` Andi Kleen
2010-06-15 22:33                                     ` Tony Luck
2010-06-15 22:33                                       ` Tony Luck
2010-06-15  6:44                                 ` Andi Kleen
2010-06-15  6:44                                   ` Andi Kleen
2010-05-19  9:03                   ` Andi Kleen
2010-05-19  9:03                     ` Andi Kleen
2010-05-24 16:21                     ` Russ Anderson
2010-05-24 16:21                       ` Russ Anderson
2010-05-24 18:26                       ` Andi Kleen
2010-05-24 18:26                         ` Andi Kleen
2010-05-19 17:30                   ` Tony Luck
2010-05-19 17:30                     ` Tony Luck
2010-05-24 15:55                     ` Russ Anderson
2010-05-24 15:55                       ` Russ Anderson
2010-05-24 17:35                       ` Tony Luck
2010-05-24 17:35                         ` Tony Luck
2010-05-24 18:31                         ` Andi Kleen
2010-05-24 18:31                           ` Andi Kleen
2010-05-18 22:29               ` Ingo Molnar
2010-05-18 22:29                 ` Ingo Molnar
2010-05-18 19:30           ` Ingo Molnar
2010-05-18 19:30             ` Ingo Molnar
2010-05-18 20:42             ` Ingo Molnar
2010-05-18 21:37               ` Tony Luck
2010-05-18 22:00                 ` Ingo Molnar
2010-05-24 17:13                   ` Russ Anderson
2010-05-19  6:39                 ` Ingo Molnar
2010-05-18 13:06 ` Borislav Petkov
2010-05-18 13:06   ` Borislav Petkov
2010-05-18 16:52   ` Mauro Carvalho Chehab
2010-05-18 16:52     ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
2010-05-18 17:06   ` Mauro Carvalho Chehab
2010-06-16  8:57 George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.