From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261156AbVFOPIY (ORCPT ); Wed, 15 Jun 2005 11:08:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261158AbVFOPIY (ORCPT ); Wed, 15 Jun 2005 11:08:24 -0400 Received: from one.firstfloor.org ([213.235.205.2]:50664 "EHLO one.firstfloor.org") by vger.kernel.org with ESMTP id S261156AbVFOPIH (ORCPT ); Wed, 15 Jun 2005 11:08:07 -0400 To: Russ Anderson Cc: linux-kernel@vger.kernel.org Subject: Re: [RCF] Linux memory error handling References: <200506151430.j5FEUD7J1393603@clink.americas.sgi.com> From: Andi Kleen Date: Wed, 15 Jun 2005 17:08:05 +0200 In-Reply-To: <200506151430.j5FEUD7J1393603@clink.americas.sgi.com> (Russ Anderson's message of "Wed, 15 Jun 2005 09:30:13 -0500 (CDT)") Message-ID: User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Russ Anderson writes: > [RCF] Linux memory error handling. RCF? RFC? > > Summary: One of the most common hardware failures in a computer > is a memory failure. There has been efforts in various > architectures to support recover from memory errors. This > is an attempt to define a common support infrastructure > in Linux to support memory error handling. Yes that is badly needed. With rmap we can do much better than we used to do. That code should be common though, not specific to an architecture. > Corrected error handling: > > Logging: When ECC hardware corrects a Single Bit Error (SBE), > an interrupt is generated to inform linux that there is > a corrected error record available for logging. I don't think it makes sense to commonize this - many platforms want to log these errors to platform specific firmware logs (like IA64 or PPC). Others who don't have such powerful firmware need to do their own thing (like x86-64's mcelog). But I don't see much commodiality. > > Polling Threshold: A solid single bit error can cause a burst > of correctable errors that can cause a significant logging > overhead. SBE thresholding counts the number of SBEs for > a given page and if too many SBEs are detected in a given > period of time, the interrupt is disabled and instead > linux periodically polls for corrected errors. I don't see how this could be sanely done in common code. It is deeply architecture specific. > > Data Migration: If a page of memory has too many single bit > errors, it may be prudent to move the data off that > physical page before the correctable SBE turns into an > uncorrectable MBE. This should be common code indeed. Similar for handling uncorrectable errors; e.g. swap the page in again from disk if possible or kill the application. That should be imho all common code I did a prototype of this some time ago, but ran out of time and it wasn't that useful on my platform anyways so I gave it up. > > Memory handling parameters: > > Since memory failure modes are due to specific DIMM > failure characteristics, there is will be no way to > reach agreement on one set of thresholds that will > be appropriate for all configurations. Therefore there > needs to be a way to modify the thresholds. One alternative > is a /proc/sys/kernel/ interface to control settings, such > as polling thresholds. That provides an easy standard > way of modifying thresholds to match the characteristics > of the specific DIMM type. This is deeply architecture and even platform specific. > > Uncorrected error handling: > > Kill the application: One recovery technique to avoid a kernel > panic when an application process hits an uncorrectable > memory error is to SIGKILL the application. The page is > marked PG_reserved to avoid re-use. A (new) PG_hard_error > flag would be useful to indicate that the physical page has > a hard memory error. No need for a new flag, just allocate it. This should be indeed common code using the rmap infrastructure. > Disable memory for next reboot: When a hard error is detected, > notify SAL/BIOS of the bad physical memory. SAL/BIOS can > save the bad addresses and, when building the EFI map after > reset/reboot, mark the bad pages as EFI_UNUSABLE_MEMORY, > and type = 0, so Linux will ignore granules contains these > pages. Deeply hardware specific. > Dumping: Dump programs should not try to dump pages with bad > memory. A PG_hard_error flag would indicate to dump > programs which pages have bad memory. There is no dump program in mainline. I have no problem with the flag, but for some reason the struct page bits seem to be very contended and 32bit will run out of them in the forseeable future. > > Memory DIMM information & settings: > > Use a /proc/dimm_info interface to pass DIMM information to Linux. > Hardware vendors could add their hardware specific settings. I don't think it makes sense to put any of this in common code. > Page Flags: When a page is discarded, PG_reserved is set so that the > page is no longer used. A PG_hard_error flag could be added That is not quite how PG_reserved works... > to indicate the physical page has bad memory. > > Pseudo task switching: Some architectures signal memory errors via > non maskable interrupts, with unusual calling sequences into > the OS. It is often easier to process these non-maskable > errors on a stack that is separate from the normal kernel > stacks. This requires non-blocking scheduler interfaces > to obtain the current running task, to modify the pointer > to the current running task and to reset that pointer when > the memory error has been processed. A "non blocking interface to obtain the current task"? aka "current"? I sense some confusion here ;-) Doing all the rmap process lookup etc. needed for the advanced handling needs to take sleep locks. No way around that. What I did in my x86-64 prototype to handle this was to raise a "self interrupt" (kind of a IPI to the current CPU that would raise next time interrupts were enabled or immediately in user space etc.) and then in the self interrupt where you have a defined context queue work for a CPU workqueue. The workqueue would then take the mm locks and look up the processes mapping the page and kill them etc. Basically the trick is to keep the tricky fully lockless part of the MCE handler as small as possible and "bootstrap" yourself in multiple steps to a defined process context where you can use the rest of the kernel sanely. This implies the actual machine check is processed a bit later. That is fine because near all CPUs seem to cause machine checks asynchronously to the normal instruction stream anyways (so you are already "too late") and adding a bit more delay is not too different. Trying to complicate everything and processing the MCE immediately thus does not help too much. For the common case of the MCE happening in user space it will be always immediately after the exception anyways. -Andi