linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RCF] Linux memory error handling
@ 2005-06-15 14:30 Russ Anderson
  2005-06-15 15:08 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Russ Anderson @ 2005-06-15 14:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Russ Anderson

		[RCF] Linux memory error handling.

Summary: One of the most common hardware failures in a computer 
	is a memory failure.   There has been efforts in various
	architectures to support recover from memory errors.  This
	is an attempt to define a common support infrastructure
	in Linux to support memory error handling.

Background:  There has been considerable work on recovering from
	Machine Check Aborts (MCAs) in arch/ia64.  One result is
	that many memory errors encountered by user applications
	not longer cause a kernel panic.  The application is 
	terminated, but linux and other applications keep running.
	Additional improvements are becoming dependent on mainline
	linux support.  That requires involvement of lkml, not
	just linux-ia64.

Types of memory failures:

	Memory hardware failures are very hardware implementation 
	specific, but there are some general characteristics.

	    Corrected errors: Error Correction Codes (ECC) in memory 
		hardware can correct Single Bit Errors (SBEs).  

	    Uncorrected errors: Parity errors and Multiple Bit Errors (MBEs)
		are errors that hardware cannot correct.  In this case the
		data in memory is no longer valid and cannot be used.

	There are different types of memory errors:

	    Transient errors: The bit showed up bad, but re-reading the
		data returns the correct data.

	    Soft errors: A bit in memory has changed state, but the 
		the underlying memory cell still works.  For example
		a particle strike can sometimes cause a bit to switch.
		In this case, re-writing the data corrects the error.

	    Hard errors:  The memory storage cell cannot hold the bit.  
		The underlying memory cell could be stuck at 0 or 1.

	A common question is whether single bit (corrected) errors will 
	turn into double bit (uncorrected) errors.  The answer is it
	depends on the underlying cause of the memory error.  There are
	some errors that show up as single bits, especially transient 
	and soft errors, that do not degrade over time.  There are other
	failures that do degrade over time.  The details of the memory
	technology are implementation specific and too detailed for
	this discussion.

Handling memory errors:

	Some memory error handling functionality is common to
	most architectures.

	Corrected error handling:

	    Logging:  When ECC hardware corrects a Single Bit Error (SBE),
		an interrupt is generated to inform linux that there is 
		a corrected error record available for logging.

	    Polling Threshold:  A solid single bit error can cause a burst
		of correctable errors that can cause a significant logging
		overhead.  SBE thresholding counts the number of SBEs for
		a given page and if too many SBEs are detected in a given
		period of time, the interrupt is disabled and instead 
		linux periodically polls for corrected errors.

	    Data Migration:  If a page of memory has too many single bit
		errors, it may be prudent to move the data off that
		physical page before the correctable SBE turns into an
		uncorrectable MBE. 

	    Memory handling parameters:

		Since memory failure modes are due to specific DIMM
		failure characteristics, there is will be no way to 
		reach agreement on one set of thresholds that will
		be appropriate for all configurations.  Therefore there
		needs to be a way to modify the thresholds.  One alternative
		is a /proc/sys/kernel/ interface to control settings, such
		as polling thresholds.  That provides an easy standard
		way of modifying thresholds to match the characteristics
		of the specific DIMM type.

	Uncorrected error handling:

	    Kill the application:  One recovery technique to avoid a kernel
		panic when an application process hits an uncorrectable 
		memory error is to SIGKILL the application.  The page is 
		marked PG_reserved to avoid re-use.  A (new) PG_hard_error
		flag would be useful to indicate that the physical page has
		a hard memory error.

	    Disable memory for next reboot:  When a hard error is detected,
		notify SAL/BIOS of the bad physical memory.  SAL/BIOS can
		save the bad addresses and, when building the EFI map after
		reset/reboot, mark the bad pages as EFI_UNUSABLE_MEMORY,
		and type = 0, so Linux will ignore granules contains these 
		pages.

	    Dumping:  Dump programs should not try to dump pages with bad
		memory.  A PG_hard_error flag would indicate to dump
		programs which pages have bad memory.

	Memory DIMM information & settings:

	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
	    Hardware vendors could add their hardware specific settings.

Linux infrastructure:

	Some infrastructure that could be added to linux that would be
	useful to various architectures.

	Page Flags:  When a page is discarded, PG_reserved is set so that the
		page is no longer used.  A PG_hard_error flag could be added
		to indicate the physical page has bad memory.

	/proc interfaces:  Use /proc interfaces to change thresholds and
		pass information to/from BIOS/SAL.  

	Pseudo task switching:  Some architectures signal memory errors via
		non maskable interrupts, with unusual calling sequences into
		the OS.  It is often easier to process these non-maskable
		errors on a stack that is separate from the normal kernel
		stacks.  This requires non-blocking scheduler interfaces
		to obtain the current running task, to modify the pointer
		to the current running task and to reset that pointer when
		the memory error has been processed.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 14:30 [RCF] Linux memory error handling Russ Anderson
@ 2005-06-15 15:08 ` Andi Kleen
  2005-06-15 16:36   ` Russ Anderson
  2005-06-15 15:26 ` Maciej W. Rozycki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2005-06-15 15:08 UTC (permalink / raw)
  To: Russ Anderson; +Cc: linux-kernel

Russ Anderson <rja@sgi.com> writes:

> 		[RCF] Linux memory error handling.

RCF? RFC?

>
> Summary: One of the most common hardware failures in a computer 
> 	is a memory failure.   There has been efforts in various
> 	architectures to support recover from memory errors.  This
> 	is an attempt to define a common support infrastructure
> 	in Linux to support memory error handling.

Yes that is badly needed. With rmap we can do much better than
we used to do. That code should be common though, not specific
to an architecture.

> 	Corrected error handling:
>
> 	    Logging:  When ECC hardware corrects a Single Bit Error (SBE),
> 		an interrupt is generated to inform linux that there is 
> 		a corrected error record available for logging.

I don't think it makes sense to commonize this - many platforms
want to log these errors to platform specific firmware logs (like
IA64 or PPC). Others who don't have such powerful firmware need
to do their own thing (like x86-64's mcelog). But I don't see much
commodiality. 

>
> 	    Polling Threshold:  A solid single bit error can cause a burst
> 		of correctable errors that can cause a significant logging
> 		overhead.  SBE thresholding counts the number of SBEs for
> 		a given page and if too many SBEs are detected in a given
> 		period of time, the interrupt is disabled and instead 
> 		linux periodically polls for corrected errors.

I don't see how this could be sanely done in common code. It is deeply
architecture specific.

>
> 	    Data Migration:  If a page of memory has too many single bit
> 		errors, it may be prudent to move the data off that
> 		physical page before the correctable SBE turns into an
> 		uncorrectable MBE. 

This should be common code indeed.

Similar for handling uncorrectable errors; e.g. swap the page 
in again from disk if possible or kill the application. That should
be imho all common code

I did a prototype of this some time ago, but ran out of time
and it wasn't that useful on my platform anyways so I gave it up. 

>
> 	    Memory handling parameters:
>
> 		Since memory failure modes are due to specific DIMM
> 		failure characteristics, there is will be no way to 
> 		reach agreement on one set of thresholds that will
> 		be appropriate for all configurations.  Therefore there
> 		needs to be a way to modify the thresholds.  One alternative
> 		is a /proc/sys/kernel/ interface to control settings, such
> 		as polling thresholds.  That provides an easy standard
> 		way of modifying thresholds to match the characteristics
> 		of the specific DIMM type.

This is deeply architecture and even platform specific.

>
> 	Uncorrected error handling:
>
> 	    Kill the application:  One recovery technique to avoid a kernel
> 		panic when an application process hits an uncorrectable 
> 		memory error is to SIGKILL the application.  The page is 
> 		marked PG_reserved to avoid re-use.  A (new) PG_hard_error
> 		flag would be useful to indicate that the physical page has
> 		a hard memory error.

No need for a new flag, just allocate it. This should be indeed common
code using the rmap infrastructure.

> 	    Disable memory for next reboot:  When a hard error is detected,
> 		notify SAL/BIOS of the bad physical memory.  SAL/BIOS can
> 		save the bad addresses and, when building the EFI map after
> 		reset/reboot, mark the bad pages as EFI_UNUSABLE_MEMORY,
> 		and type = 0, so Linux will ignore granules contains these 
> 		pages.

Deeply hardware specific.

> 	    Dumping:  Dump programs should not try to dump pages with bad
> 		memory.  A PG_hard_error flag would indicate to dump
> 		programs which pages have bad memory.

There is no dump program in mainline. I have no problem with the flag,
but for some reason the struct page bits seem to be very contended
and 32bit will run out of them in the forseeable future.

>
> 	Memory DIMM information & settings:
>
> 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> 	    Hardware vendors could add their hardware specific settings.

I don't think it makes sense to put any of this in common code.

> 	Page Flags:  When a page is discarded, PG_reserved is set so that the
> 		page is no longer used.  A PG_hard_error flag could be added

That is not quite how PG_reserved works...

> 		to indicate the physical page has bad memory.
>
> 	Pseudo task switching:  Some architectures signal memory errors via
> 		non maskable interrupts, with unusual calling sequences into
> 		the OS.  It is often easier to process these non-maskable
> 		errors on a stack that is separate from the normal kernel
> 		stacks.  This requires non-blocking scheduler interfaces
> 		to obtain the current running task, to modify the pointer
> 		to the current running task and to reset that pointer when
> 		the memory error has been processed.


A "non blocking interface to obtain the current task"? aka "current"?

I sense some confusion here ;-)

Doing all the rmap process lookup etc. needed for the advanced handling
needs to take sleep locks. No way around that.

What I did in my x86-64 prototype to handle this was to raise a
"self interrupt" (kind of a IPI to the current CPU that would raise
next time interrupts were enabled or immediately in user space etc.)
and then in the self interrupt where you have a defined context
queue work for a CPU workqueue. The workqueue would then take
the mm locks and look up the processes mapping the page and kill
them etc.

Basically the trick is to keep the tricky fully lockless part
of the MCE handler as small as possible and "bootstrap" yourself
in multiple steps to a defined process context where you can use
the rest of the kernel sanely.

This implies the actual machine check is processed a bit later. That
is fine because near all CPUs seem to cause machine checks asynchronously
to the normal instruction stream anyways (so you are already "too late")
and adding a bit more delay is not too different. Trying to complicate
everything and processing the MCE immediately thus does not help too much.
For the common case of the MCE happening in user space it will be always
immediately after the exception anyways.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 14:30 [RCF] Linux memory error handling Russ Anderson
  2005-06-15 15:08 ` Andi Kleen
@ 2005-06-15 15:26 ` Maciej W. Rozycki
  2005-06-15 19:46   ` Russell King
                     ` (2 more replies)
  2005-06-15 20:42 ` Joel Schopp
  2005-06-16  2:54 ` Wang, Zhenyu
  3 siblings, 3 replies; 16+ messages in thread
From: Maciej W. Rozycki @ 2005-06-15 15:26 UTC (permalink / raw)
  To: Russ Anderson; +Cc: linux-kernel

On Wed, 15 Jun 2005, Russ Anderson wrote:

> Handling memory errors:
> 
> 	Some memory error handling functionality is common to
> 	most architectures.
> 
> 	Corrected error handling:
> 
> 	    Logging:  When ECC hardware corrects a Single Bit Error (SBE),
> 		an interrupt is generated to inform linux that there is 
> 		a corrected error record available for logging.
> 
> 	    Polling Threshold:  A solid single bit error can cause a burst
> 		of correctable errors that can cause a significant logging
> 		overhead.  SBE thresholding counts the number of SBEs for
> 		a given page and if too many SBEs are detected in a given
> 		period of time, the interrupt is disabled and instead 
> 		linux periodically polls for corrected errors.

 This is highly undesirable if the same interrupt is used for MBEs.  A 
page that causes an excessive number of SBEs should rather be removed from 
the available pool instead.  Logging should probably take recent events 
into account anyway and take care of not overloading the system, e.g. by 
keeping only statistical data instead of detailed information about each 
event under load.

> 	    Data Migration:  If a page of memory has too many single bit
> 		errors, it may be prudent to move the data off that
> 		physical page before the correctable SBE turns into an
> 		uncorrectable MBE. 
> 
> 	    Memory handling parameters:
> 
> 		Since memory failure modes are due to specific DIMM
> 		failure characteristics, there is will be no way to 
> 		reach agreement on one set of thresholds that will
> 		be appropriate for all configurations.  Therefore there
> 		needs to be a way to modify the thresholds.  One alternative
> 		is a /proc/sys/kernel/ interface to control settings, such
> 		as polling thresholds.  That provides an easy standard
> 		way of modifying thresholds to match the characteristics
> 		of the specific DIMM type.

 Note that scrubbing may also be required depending on hardware 
capabilities as data could have been corrected on the fly for the purpose 
of providing a correct value for the bus transaction, but memory may still 
hold corrupted data.

 And of course not all memory is DIMM!

> 	Uncorrected error handling:
> 
> 	    Kill the application:  One recovery technique to avoid a kernel
> 		panic when an application process hits an uncorrectable 
> 		memory error is to SIGKILL the application.  The page is 
> 		marked PG_reserved to avoid re-use.  A (new) PG_hard_error
> 		flag would be useful to indicate that the physical page has
> 		a hard memory error.

 Note we have some infrastructure for that in the MIPS port -- we kill the 
triggering process, but we don't mark the problematic memory page as 
unusable (which is an area for improvement).  This is of course the case 
for faults occurring synchronously in the user mode -- when in the kernel 
mode or when happening asynchronously (e.g. because of being triggered by 
a DMA transaction rather than one involving a CPU) you often cannot 
determine whether killing a process is good enough for system safety even 
if you are able to narrow the fault down to a potential victim.

> 	    Disable memory for next reboot:  When a hard error is detected,
> 		notify SAL/BIOS of the bad physical memory.  SAL/BIOS can
> 		save the bad addresses and, when building the EFI map after
> 		reset/reboot, mark the bad pages as EFI_UNUSABLE_MEMORY,
> 		and type = 0, so Linux will ignore granules contains these 
> 		pages.
> 
> 	    Dumping:  Dump programs should not try to dump pages with bad
> 		memory.  A PG_hard_error flag would indicate to dump
> 		programs which pages have bad memory.
> 
> 	Memory DIMM information & settings:
> 
> 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> 	    Hardware vendors could add their hardware specific settings.

 I'd recommend a more generic name rather than "dimm_info" if that is to 
be reused universally.

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 15:08 ` Andi Kleen
@ 2005-06-15 16:36   ` Russ Anderson
  0 siblings, 0 replies; 16+ messages in thread
From: Russ Anderson @ 2005-06-15 16:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Russ Anderson, linux-kernel

Andi Kleen wrote:
> Russ Anderson <rja@sgi.com> writes:
> 
> > 		[RCF] Linux memory error handling.
> 
> RCF? RFC?

(sigh) RFC.  I ran the document through the spellchecker and
still missed the first three letters. 

> > Summary: One of the most common hardware failures in a computer 
> > 	is a memory failure.   There has been efforts in various
> > 	architectures to support recover from memory errors.  This
> > 	is an attempt to define a common support infrastructure
> > 	in Linux to support memory error handling.
> 
> Yes that is badly needed. With rmap we can do much better than
> we used to do. That code should be common though, not specific
> to an architecture.
>
> > 	Corrected error handling:
> >
> > 	    Logging:  When ECC hardware corrects a Single Bit Error (SBE),
> > 		an interrupt is generated to inform linux that there is 
> > 		a corrected error record available for logging.
> 
> I don't think it makes sense to commonize this - many platforms
> want to log these errors to platform specific firmware logs (like
> IA64 or PPC). Others who don't have such powerful firmware need
> to do their own thing (like x86-64's mcelog). But I don't see much
> commodiality. 

Sure.  It should only be common code when it makes sense.

> > 	    Polling Threshold:  A solid single bit error can cause a burst
> > 		of correctable errors that can cause a significant logging
> > 		overhead.  SBE thresholding counts the number of SBEs for
> > 		a given page and if too many SBEs are detected in a given
> > 		period of time, the interrupt is disabled and instead 
> > 		linux periodically polls for corrected errors.
> 
> I don't see how this could be sanely done in common code. It is deeply
> architecture specific.

This is what could be used to trigger the data migration (common) code.
It's the interface from arch specific to common code that has pushed
me from linux-ia64 to lkml.

> > 	    Data Migration:  If a page of memory has too many single bit
> > 		errors, it may be prudent to move the data off that
> > 		physical page before the correctable SBE turns into an
> > 		uncorrectable MBE. 
> 
> This should be common code indeed.
>
> Similar for handling uncorrectable errors; e.g. swap the page 
> in again from disk if possible or kill the application. That should
> be imho all common code

Yup.
 
> I did a prototype of this some time ago, but ran out of time
> and it wasn't that useful on my platform anyways so I gave it up. 
> 
> >
> > 	    Memory handling parameters:
> >
> > 		Since memory failure modes are due to specific DIMM
> > 		failure characteristics, there is will be no way to 
> > 		reach agreement on one set of thresholds that will
> > 		be appropriate for all configurations.  Therefore there
> > 		needs to be a way to modify the thresholds.  One alternative
> > 		is a /proc/sys/kernel/ interface to control settings, such
> > 		as polling thresholds.  That provides an easy standard
> > 		way of modifying thresholds to match the characteristics
> > 		of the specific DIMM type.
> 
> This is deeply architecture and even platform specific.

The implementation is arch specific, but the external interface could
be common.  If common doesn't make sense, I'll just add it in linux-ia64
and be done with it.  :-)

> > 	Uncorrected error handling:
> >
> > 	    Kill the application:  One recovery technique to avoid a kernel
> > 		panic when an application process hits an uncorrectable 
> > 		memory error is to SIGKILL the application.  The page is 
> > 		marked PG_reserved to avoid re-use.  A (new) PG_hard_error
> > 		flag would be useful to indicate that the physical page has
> > 		a hard memory error.
> 
> No need for a new flag, just allocate it.

That is what the current code does.  Looking ahead, it would be nice to
keep track of the bad memory, so that other processes, such as a dump
program, does not try to access it.  The PG_hard_error flag is one
idea, but others may have a better idea.  Conversely, a diag program may
want to access it to do additional analysys.  The hot-plug people,
working on page migration, were wondering how to deal with pages
marked reserved.  Bad data on bad memory pages does not need to
be migrated.  They need to know what data not to migrate.

>                                             This should be indeed common
> code using the rmap infrastructure.
>
> > 	    Disable memory for next reboot:  When a hard error is detected,
> > 		notify SAL/BIOS of the bad physical memory.  SAL/BIOS can
> > 		save the bad addresses and, when building the EFI map after
> > 		reset/reboot, mark the bad pages as EFI_UNUSABLE_MEMORY,
> > 		and type = 0, so Linux will ignore granules contains these 
> > 		pages.
> 
> Deeply hardware specific.

My intent was a common interface to tell EFI of a bad address.
In ia64, I could add a SAL call to tell our SAL(SGI PROM) of
the bad address, to get this functionality.  Very platform specific.
Perhaps a more generic interface would add more value for more 
platforms.  That was my intent.  

> > 	    Dumping:  Dump programs should not try to dump pages with bad
> > 		memory.  A PG_hard_error flag would indicate to dump
> > 		programs which pages have bad memory.
> 
> There is no dump program in mainline. I have no problem with the flag,
> but for some reason the struct page bits seem to be very contended
> and 32bit will run out of them in the forseeable future.

Add more bits. :-)  I realize that flag bits are more limited with
32bit, but adding a page flag is a lkml issue, not a linux-ia64 issue.
So I need to discuss this issue here.  Perhaps there is an alternative
way to achieve the needed functionality.

> > 	Memory DIMM information & settings:
> >
> > 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> > 	    Hardware vendors could add their hardware specific settings.
> 
> I don't think it makes sense to put any of this in common code.
> 
> > 	Page Flags:  When a page is discarded, PG_reserved is set so that the
> > 		page is no longer used.  A PG_hard_error flag could be added
> 
> That is not quite how PG_reserved works...

That's why it needs improvement.

> > 		to indicate the physical page has bad memory.
> >
> > 	Pseudo task switching:  Some architectures signal memory errors via
> > 		non maskable interrupts, with unusual calling sequences into
> > 		the OS.  It is often easier to process these non-maskable
> > 		errors on a stack that is separate from the normal kernel
> > 		stacks.  This requires non-blocking scheduler interfaces
> > 		to obtain the current running task, to modify the pointer
> > 		to the current running task and to reset that pointer when
> > 		the memory error has been processed.
> 
> 
> A "non blocking interface to obtain the current task"? aka "current"?
> 
> I sense some confusion here ;-)

See "[RFD] Separating struct task and the kernel stacks"
http://www.gelato.unsw.edu.au/linux-ia64/0506/14426.html

Thanks,
-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 15:26 ` Maciej W. Rozycki
@ 2005-06-15 19:46   ` Russell King
  2005-06-15 20:28     ` [RFC] " Russ Anderson
  2005-06-15 22:09   ` Russ Anderson
  2005-06-16  1:03   ` [RCF] " Ross Biro
  2 siblings, 1 reply; 16+ messages in thread
From: Russell King @ 2005-06-15 19:46 UTC (permalink / raw)
  To: Maciej W. Rozycki, Russ Anderson; +Cc: linux-kernel

On Wed, Jun 15, 2005 at 04:26:13PM +0100, Maciej W. Rozycki wrote:
> On Wed, 15 Jun 2005, Russ Anderson wrote:
> > 	Memory DIMM information & settings:
> > 
> > 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> > 	    Hardware vendors could add their hardware specific settings.
> 
>  I'd recommend a more generic name rather than "dimm_info" if that is to 
> be reused universally.

Agree.

I'd also suggest that there be some method to tell the kernel from
architecture code about this "dimm_info" stuff - many embedded
platforms already know their memory organisation.

BTW, Russ, could we have a better description of what information is
intended to be supplied?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 19:46   ` Russell King
@ 2005-06-15 20:28     ` Russ Anderson
  2005-06-15 20:45       ` Dave Hansen
  0 siblings, 1 reply; 16+ messages in thread
From: Russ Anderson @ 2005-06-15 20:28 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel, Russ Anderson

Russell King wrote:
> On Wed, Jun 15, 2005 at 04:26:13PM +0100, Maciej W. Rozycki wrote:
> > On Wed, 15 Jun 2005, Russ Anderson wrote:
> > > 	Memory DIMM information & settings:
> > > 
> > > 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> > > 	    Hardware vendors could add their hardware specific settings.
> > 
> >  I'd recommend a more generic name rather than "dimm_info" if that is to 
> > be reused universally.
> 
> Agree.

I really don't care what it's called, as long as it's descriptive.
/proc/meminfo is taken.  :-)

One idea would follow the concept of /proc/bus/ and have /proc/memory/
with different memory types.  /proc/memory/dimm0 /proc/memory/dimm1
/proc/memory/flash0 .  
 
> I'd also suggest that there be some method to tell the kernel from
> architecture code about this "dimm_info" stuff - many embedded
> platforms already know their memory organisation.
> 
> BTW, Russ, could we have a better description of what information is
> intended to be supplied?

Part tracking info and configuration info.  For example, we were doing
some experiments to determine the relationship between refresh rates
and memory errors.  Could increasing the refresh rate reduce the number
of memory errors, therefor making memory more reliable for customers?
Could decreasing the refresh rate in manufacturing be used to identify
questionable DIMMs?  Having a convient interface to read the current
refresh rate setting and write a new setting would be useful.

This type info, not necessarily in this format:
------------------------------------------------------------------------------

EEPROM     JEDEC-SPD Info           Part Number        Rev  Speed  SGI      BC
---------- ------------------------ ------------------ ---- ------ -------- --
DIMM0 N0 L CE0000000000000006071D84 M3 12L6423DT0-CB3   0D    6.0  09/02/03 00
DIMM1 N0 L CE0000000000000006051CB2 M3 12L6423DT0-CB3   0D    6.0  09/02/03 00
DIMM2 N0 L no hardware detected
DIMM3 N0 L no hardware detected


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 14:30 [RCF] Linux memory error handling Russ Anderson
  2005-06-15 15:08 ` Andi Kleen
  2005-06-15 15:26 ` Maciej W. Rozycki
@ 2005-06-15 20:42 ` Joel Schopp
  2005-06-16  2:54 ` Wang, Zhenyu
  3 siblings, 0 replies; 16+ messages in thread
From: Joel Schopp @ 2005-06-15 20:42 UTC (permalink / raw)
  To: Russ Anderson; +Cc: linux-kernel

> 	A common question is whether single bit (corrected) errors will 
> 	turn into double bit (uncorrected) errors.  The answer is it
> 	depends on the underlying cause of the memory error.  There are
> 	some errors that show up as single bits, especially transient 
> 	and soft errors, that do not degrade over time.  There are other
> 	failures that do degrade over time.

This sounds like one of our primary motivations for working on memory 
hotplug remove.  Detection of recoverable errors that degrade to 
unrecoverable errors, but don't because we remove the memory before it 
gets that far.

Much PPC64 hardware/firmware already supports this detection.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 20:28     ` [RFC] " Russ Anderson
@ 2005-06-15 20:45       ` Dave Hansen
  2005-06-15 21:27         ` Russ Anderson
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2005-06-15 20:45 UTC (permalink / raw)
  To: Russ Anderson; +Cc: Russell King, Linux Kernel Mailing List

On Wed, 2005-06-15 at 15:28 -0500, Russ Anderson wrote:
> Russell King wrote:
> > On Wed, Jun 15, 2005 at 04:26:13PM +0100, Maciej W. Rozycki wrote:
> > > On Wed, 15 Jun 2005, Russ Anderson wrote:
> > > > 	Memory DIMM information & settings:
> > > > 
> > > > 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> > > > 	    Hardware vendors could add their hardware specific settings.
> > > 
> > >  I'd recommend a more generic name rather than "dimm_info" if that is to 
> > > be reused universally.
> > 
> I really don't care what it's called, as long as it's descriptive.
> /proc/meminfo is taken.  :-)
> 
> One idea would follow the concept of /proc/bus/ and have /proc/memory/
> with different memory types.  /proc/memory/dimm0 /proc/memory/dimm1
> /proc/memory/flash0 .  

Please don't do this in /proc.  If it's a piece of hardware, and it
needs to have some information about it exported, then you need to use
kobjects and sysfs.  

-- Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 20:45       ` Dave Hansen
@ 2005-06-15 21:27         ` Russ Anderson
  2005-06-15 21:33           ` Dave Hansen
  0 siblings, 1 reply; 16+ messages in thread
From: Russ Anderson @ 2005-06-15 21:27 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Russ Anderson, Russell King, Linux Kernel Mailing List

Dave Hansen wrote:
> On Wed, 2005-06-15 at 15:28 -0500, Russ Anderson wrote:
> > Russell King wrote:
> > > On Wed, Jun 15, 2005 at 04:26:13PM +0100, Maciej W. Rozycki wrote:
> > > > On Wed, 15 Jun 2005, Russ Anderson wrote:
> > > > > 	Memory DIMM information & settings:
> > > > > 
> > > > > 	    Use a /proc/dimm_info interface to pass DIMM information to Linux.
> > > > > 	    Hardware vendors could add their hardware specific settings.
> > > > 
> > > >  I'd recommend a more generic name rather than "dimm_info" if that is to 
> > > > be reused universally.
> > > 
> > I really don't care what it's called, as long as it's descriptive.
> > /proc/meminfo is taken.  :-)
> > 
> > One idea would follow the concept of /proc/bus/ and have /proc/memory/
> > with different memory types.  /proc/memory/dimm0 /proc/memory/dimm1
> > /proc/memory/flash0 .  
> 
> Please don't do this in /proc.  If it's a piece of hardware, and it
> needs to have some information about it exported, then you need to use
> kobjects and sysfs.  

How about /sys/devices/system/memory/dimmX with links in
/sys/devices/system/node/nodeX/ ?  Does that sound better?


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 21:27         ` Russ Anderson
@ 2005-06-15 21:33           ` Dave Hansen
  2005-06-20 20:42             ` Russ Anderson
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2005-06-15 21:33 UTC (permalink / raw)
  To: Russ Anderson; +Cc: Russell King, Linux Kernel Mailing List

On Wed, 2005-06-15 at 16:27 -0500, Russ Anderson wrote:
> How about /sys/devices/system/memory/dimmX with links in
> /sys/devices/system/node/nodeX/ ?  Does that sound better?

Much better than /proc :)

However, we're already using /sys/devices/system/memory/ for memory
hotplug to represent Linux's view of memory, and which physical
addresses it is currently using.  I've thought about this before, and I
think that we may want to have /sys/.../memory/hardware for the DIMM
information and memory/logical for the memory hotplug controls.

One other minor thing.  You might want to think about referring to the
pieces of memory as things other than DIMMs.  On ppc64, for instance,
the hypervisor hands off memory in sections called LMBs (logical memory
blocks), and they're not directly related to any hardware DIMM.  The
same things will show up in other virtualized environments.

-- Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 15:26 ` Maciej W. Rozycki
  2005-06-15 19:46   ` Russell King
@ 2005-06-15 22:09   ` Russ Anderson
  2005-06-16 19:42     ` Maciej W. Rozycki
  2005-06-16  1:03   ` [RCF] " Ross Biro
  2 siblings, 1 reply; 16+ messages in thread
From: Russ Anderson @ 2005-06-15 22:09 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Russ Anderson, linux-kernel

Maciej W. Rozycki wrote:
> On Wed, 15 Jun 2005, Russ Anderson wrote:
> 
> > Handling memory errors:
> > 
> > 	    Polling Threshold:  A solid single bit error can cause a burst
> > 		of correctable errors that can cause a significant logging
> > 		overhead.  SBE thresholding counts the number of SBEs for
> > 		a given page and if too many SBEs are detected in a given
> > 		period of time, the interrupt is disabled and instead 
> > 		linux periodically polls for corrected errors.
> 
>  This is highly undesirable if the same interrupt is used for MBEs.  A 
> page that causes an excessive number of SBEs should rather be removed from 
> the available pool instead.

As a practical point I think you are right that if there are enough 
SBEs to cause a performance hit, migrating the data to a different 
physical page would be a prudent thing to do.  But that functionality
hasn't been implemented yet.

That may not always be the right setting for all customers.
One possible way to deal with that would be to have different
threshold settings for logging and page migration.  That would
provide flexibility.

>                              Logging should probably take recent events 
> into account anyway and take care of not overloading the system, e.g. by 
> keeping only statistical data instead of detailed information about each 
> event under load.

That's what the SBE thresholding does.  It avoids overloading the
system by switching from interrupt mode to periodic polling
mode, where detailed information can get dropped.

> > 	    Memory handling parameters:
> > 
> > 		Since memory failure modes are due to specific DIMM
> > 		failure characteristics, there is will be no way to 
> > 		reach agreement on one set of thresholds that will
> > 		be appropriate for all configurations.  Therefore there
> > 		needs to be a way to modify the thresholds.  One alternative
> > 		is a /proc/sys/kernel/ interface to control settings, such
> > 		as polling thresholds.  That provides an easy standard
> > 		way of modifying thresholds to match the characteristics
> > 		of the specific DIMM type.
> 
>  Note that scrubbing may also be required depending on hardware 
> capabilities as data could have been corrected on the fly for the purpose 
> of providing a correct value for the bus transaction, but memory may still 
> hold corrupted data.

Good point.

>  And of course not all memory is DIMM!

Another good point.

> > 	Uncorrected error handling:
> > 
> > 	    Kill the application:  One recovery technique to avoid a kernel
> > 		panic when an application process hits an uncorrectable 
> > 		memory error is to SIGKILL the application.  The page is 
> > 		marked PG_reserved to avoid re-use.  A (new) PG_hard_error
> > 		flag would be useful to indicate that the physical page has
> > 		a hard memory error.
> 
>  Note we have some infrastructure for that in the MIPS port -- we kill the 
> triggering process, but we don't mark the problematic memory page as 
> unusable (which is an area for improvement). 

Mips has some nice features when it comes to error recovery.

Thanks,
-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 15:26 ` Maciej W. Rozycki
  2005-06-15 19:46   ` Russell King
  2005-06-15 22:09   ` Russ Anderson
@ 2005-06-16  1:03   ` Ross Biro
  2 siblings, 0 replies; 16+ messages in thread
From: Ross Biro @ 2005-06-16  1:03 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Russ Anderson, linux-kernel

On 6/15/05, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Wed, 15 Jun 2005, Russ Anderson wrote:
> 
> >
> >           Polling Threshold:  A solid single bit error can cause a burst
> >               of correctable errors that can cause a significant logging
> >               overhead.  SBE thresholding counts the number of SBEs for
> >               a given page and if too many SBEs are detected in a given
> >               period of time, the interrupt is disabled and instead
> >               linux periodically polls for corrected errors.
> 
>  This is highly undesirable if the same interrupt is used for MBEs.  A
> page that causes an excessive number of SBEs should rather be removed from
> the available pool instead.  Logging should probably take recent events
> into account anyway and take care of not overloading the system, e.g. by
> keeping only statistical data instead of detailed information about each
> event under load.
> 

First, SBEs and MBEs are named historically and are currently called
correctable and uncorrectable errors.  Modern chip sets can often
handle many incorrect bits in a single word and still correct the
problem.  So please don't assume you can make any inferences into the
probability of an MBE because you are seeing SBEs.  Any such
inferences would need to be chip set specific.

Some common chip sets have bugs in them that can cause an excessive
number of reported SBEs.  On those chip sets with out any error
reporting, there is a noticeable performance hit when the SBE counters
go wild.  If every SBE generated an interrupt the system would grind
to a halt.  So there needs to be easy ways to disable interrupts
associated with SBEs.

Also some memory/chip set combinations generate a significant number
of SBEs with out any significant danger of an MBE, so many people will
want to ignore SBEs entirely, or only poll once in a while.

Finally, many chip sets have memory scrubbing technology that can
simultaneously generate SBEs in memory not being accessed by the
kernel and fix those errors. So don't just assume that because the
kernel isn't allowing access to a page, you won't see SBEs or MBEs
from that page.

Otherwise, anything done in this direction seems like a good idea to me.

    Ross

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RCF] Linux memory error handling
  2005-06-15 14:30 [RCF] Linux memory error handling Russ Anderson
                   ` (2 preceding siblings ...)
  2005-06-15 20:42 ` Joel Schopp
@ 2005-06-16  2:54 ` Wang, Zhenyu
  3 siblings, 0 replies; 16+ messages in thread
From: Wang, Zhenyu @ 2005-06-16  2:54 UTC (permalink / raw)
  To: Russ Anderson; +Cc: linux-kernel

On 2005.06.15 09:30:13 +0000, Russ Anderson wrote:
> 		[RCF] Linux memory error handling.
> 
> Summary: One of the most common hardware failures in a computer 
> 	is a memory failure.   There has been efforts in various
> 	architectures to support recover from memory errors.  This
> 	is an attempt to define a common support infrastructure
> 	in Linux to support memory error handling.
> 
> Background:  There has been considerable work on recovering from
> 	Machine Check Aborts (MCAs) in arch/ia64.  One result is
> 	that many memory errors encountered by user applications
> 	not longer cause a kernel panic.  The application is 
> 	terminated, but linux and other applications keep running.
> 	Additional improvements are becoming dependent on mainline
> 	linux support.  That requires involvement of lkml, not
> 	just linux-ia64.

Good RFC! Actually on x86 arch, 'bluesmoke' - http://bluesmoke.sf.net - is out 
there for some simple mem ECC error handling already. It's inspired by the old linux-ecc 
project. Current capability is limited to detect, report, configuable for polling and UE
panic. 

Bluesmoke contains a driver core which is used to host infos for each mem 
controller, like dimm info, and currently only polling method is taken for registered 
controller. Others are all the specific chipset drivers, which is mostly platform depend, 
e.g e7520, 82875P, etc. Those platforms have also been tested, bluesmoke's webpage
contains some test method if you really want to try. 

nmi handling is still under work, Dave and Corey's patch is on sourceforge page, and

    http://lkml.org/lkml/2004/8/19/140
    http://lkml.org/lkml/2005/3/21/11

Those nmi callbacks have not been added to chipset driver yet, but some initial 
testing failed, still don't know why...

thanks
-zhen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 22:09   ` Russ Anderson
@ 2005-06-16 19:42     ` Maciej W. Rozycki
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej W. Rozycki @ 2005-06-16 19:42 UTC (permalink / raw)
  To: Russ Anderson; +Cc: linux-kernel

On Wed, 15 Jun 2005, Russ Anderson wrote:

> >  This is highly undesirable if the same interrupt is used for MBEs.  A 
> > page that causes an excessive number of SBEs should rather be removed from 
> > the available pool instead.
> 
> As a practical point I think you are right that if there are enough 
> SBEs to cause a performance hit, migrating the data to a different 
> physical page would be a prudent thing to do.  But that functionality
> hasn't been implemented yet.

 There's another point actually -- if a memory location (here meaning a 
single entity covered by ECC; usually a 32-bit word or a 64-bit 
doubleword) is causing SBEs excessively, then a bit there probably 
somewhat less reliable due to wear, damage, etc.  But the normal memory 
error statistics apply to other bits at that location as usually, so now 
the probability of an MBE is greater.  So you'd better keep your data 
away.

> That may not always be the right setting for all customers.
> One possible way to deal with that would be to have different
> threshold settings for logging and page migration.  That would
> provide flexibility.

 Certainly.

> >                              Logging should probably take recent events 
> > into account anyway and take care of not overloading the system, e.g. by 
> > keeping only statistical data instead of detailed information about each 
> > event under load.
> 
> That's what the SBE thresholding does.  It avoids overloading the
> system by switching from interrupt mode to periodic polling
> mode, where detailed information can get dropped.

 But as I mentioned you have to be careful not to switch the MBE interrupt 
off as a side effect.  Actually the overhead for handling the interrupt 
shouldn't be that high, unless the error triggers in the handler itself.

> >  Note we have some infrastructure for that in the MIPS port -- we kill the 
> > triggering process, but we don't mark the problematic memory page as 
> > unusable (which is an area for improvement). 
> 
> Mips has some nice features when it comes to error recovery.

 Starting with synchronous reporting when possible. :-)

  Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-15 21:33           ` Dave Hansen
@ 2005-06-20 20:42             ` Russ Anderson
  2005-06-20 21:07               ` Dave Hansen
  0 siblings, 1 reply; 16+ messages in thread
From: Russ Anderson @ 2005-06-20 20:42 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Russ Anderson, Russell King, Linux Kernel Mailing List

Dave Hansen wrote:
> On Wed, 2005-06-15 at 16:27 -0500, Russ Anderson wrote:
> > How about /sys/devices/system/memory/dimmX with links in
> > /sys/devices/system/node/nodeX/ ?  Does that sound better?
> 
> Much better than /proc :)
> 
> However, we're already using /sys/devices/system/memory/ for memory
> hotplug to represent Linux's view of memory, and which physical
> addresses it is currently using.  I've thought about this before, and I
> think that we may want to have /sys/.../memory/hardware for the DIMM
> information and memory/logical for the memory hotplug controls.

Is there a standard for how to name hardware entries in 
/sys/devices/system (or sysfs in general)?  Seems like this
same general issue would apply to other hardware components,
cpus, nodes, etc.  
 
> One other minor thing.  You might want to think about referring to the
> pieces of memory as things other than DIMMs.  On ppc64, for instance,
> the hypervisor hands off memory in sections called LMBs (logical memory
> blocks), and they're not directly related to any hardware DIMM.  The
> same things will show up in other virtualized environments.

If we're talking about specific hardware entries, it seems like they
should be called what they are.  If we're talking about abstractions,
a more abstract name seems in order.  One of my concerns is mapping 
failures back to hardware components, hence my bias for component names.
Would having /sys/.../memory/LMB on ppc64 to correspond to 
/sys/.../memory/DIMM be a problem?  RAM would be an alternative,
but that could be confused with /sys/block/ram.  :-)

In general, I'm more concerned with getting the necessary functionality
in than what the specific entries are called, so I'll go along with
the consensus.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Linux memory error handling
  2005-06-20 20:42             ` Russ Anderson
@ 2005-06-20 21:07               ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2005-06-20 21:07 UTC (permalink / raw)
  To: Russ Anderson; +Cc: Russell King, Linux Kernel Mailing List

On Mon, 2005-06-20 at 15:42 -0500, Russ Anderson wrote:
> Is there a standard for how to name hardware entries in 
> /sys/devices/system (or sysfs in general)?

For system devices, no, I don't think so.

> Seems like his same general issue would apply to other hardware
> components, cpus, nodes, etc.  

I don't think it's true for anything other than memory.  Linux doesn't
manage CPUs or nodes any differently than it manages its internal
representations of them.

But, for memory at a level any less granular than pages, Linux and the
hardware seldom have the same view. 

> > One other minor thing.  You might want to think about referring to the
> > pieces of memory as things other than DIMMs.  On ppc64, for instance,
> > the hypervisor hands off memory in sections called LMBs (logical memory
> > blocks), and they're not directly related to any hardware DIMM.  The
> > same things will show up in other virtualized environments.
> 
> If we're talking about specific hardware entries, it seems like they
> should be called what they are.  If we're talking about abstractions,
> a more abstract name seems in order.  One of my concerns is mapping 
> failures back to hardware components, hence my bias for component names.

Even with generic names mapping back to components should be easy

Somethinng like memory/type could even contain the hardware type of each
ram entry.

> Would having /sys/.../memory/LMB on ppc64 to correspond to 
> /sys/.../memory/DIMM be a problem?

No, it wouldn't really be too much of a problem.  But, it's not a very
accurate description.  There is certainly hardware that has RAM which
does not have a single DIMM. :)

> RAM would be an alternative,
> but that could be confused with /sys/block/ram.  :-)

RAM would probably be fine.  There should be very few properties
that /sys/devices/system devices share with /sys/block, so it shouldn't
be too bad.

> In general, I'm more concerned with getting the necessary functionality
> in than what the specific entries are called, so I'll go along with
> the consensus.

I don't think there's much of a consensus.  I just want to make sure we
do something that works on all platforms consistently.  For instance,
I'd hate to have every userspace utilities that are trying to examine
memory look like this:

if [ `uname -a` == ppc64 ]; then
	UNIT=LMB
elif [ `uname -a` == ia64 ]; then
	UNIT=DIMM
etc...

FILE=/sys/devices/system/memory/$UNIT$NUMBER

-- Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-06-20 21:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-15 14:30 [RCF] Linux memory error handling Russ Anderson
2005-06-15 15:08 ` Andi Kleen
2005-06-15 16:36   ` Russ Anderson
2005-06-15 15:26 ` Maciej W. Rozycki
2005-06-15 19:46   ` Russell King
2005-06-15 20:28     ` [RFC] " Russ Anderson
2005-06-15 20:45       ` Dave Hansen
2005-06-15 21:27         ` Russ Anderson
2005-06-15 21:33           ` Dave Hansen
2005-06-20 20:42             ` Russ Anderson
2005-06-20 21:07               ` Dave Hansen
2005-06-15 22:09   ` Russ Anderson
2005-06-16 19:42     ` Maciej W. Rozycki
2005-06-16  1:03   ` [RCF] " Ross Biro
2005-06-15 20:42 ` Joel Schopp
2005-06-16  2:54 ` Wang, Zhenyu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).