linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/22] HWPOISON: Intro (v5)
@ 2009-06-15  2:45 Wu Fengguang
  2009-06-15  2:45 ` [PATCH 01/22] HWPOISON: Add page flag for poisoned pages Wu Fengguang
                   ` (22 more replies)
  0 siblings, 23 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Mel Gorman, Wu, Fengguang, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

Hi all,

Comments are warmly welcome on the newly introduced uevent code :)

I hope we can reach consensus in this round and then be able to post
a final version for .31 inclusion.

changes since v4:
- new feature: uevent report (and collect various states for that)
- early kill cleanups
- remove unnecessary code in __do_fault()
- fix compile error when feature not enabled
- fix kernel oops when invalid pfn is feed to corrupt-pfn
- make tasklist_lock/anon_vma locking order straight
- use the safer invalidate page for possible metadata pages

Patches from previous versions plus some discussed fixes:
 [PATCH 01/22] HWPOISON: Add page flag for poisoned pages
 [PATCH 02/22] HWPOISON: Export some rmap vma locking to outside world
 [PATCH 03/22] HWPOISON: Add support for poison swap entries v2
 [PATCH 04/22] HWPOISON: Add new SIGBUS error codes for hardware poison signals
 [PATCH 05/22] HWPOISON: Add basic support for poisoned pages in fault handler v3
 [PATCH 06/22] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
 [PATCH 07/22] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
 [PATCH 08/22] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
 [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
 [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
 [PATCH 11/22] HWPOISON: Refactor truncate to allow direct truncating of page v3
 [PATCH 12/22] HWPOISON: The high level memory error handler in the VM v7
 [PATCH 13/22] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
 [PATCH 14/22] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs

New additions that need reviews:
 [PATCH 15/22] HWPOISON: early kill cleanups and fixes
 [PATCH 16/22] mm: move page flag numbers for user space to page-flags.h
 [PATCH 17/22] HWPOISON: introduce struct hwpoison_control
 [PATCH 18/22] HWPOISON: use compound head page
 [PATCH 19/22] HWPOISON: detect free buddy pages explicitly
 [PATCH 20/22] HWPOISON: collect infos that reflect the impact of the memory corruption
 [PATCH 21/22] HWPOISON: send uevent to report memory corruption
 [PATCH 22/22] HWPOISON: FOR TESTING: Enable memory failure code unconditionally

---
Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future.

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c,
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might.

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.
Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome.

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.
v3: Various updates, see changelogs in individual patches.
v4: Various updates, see changelogs in individual patches.
v5: add uevent notification and some cleanups/fixes.

Thanks,
Fengguang
-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/22] HWPOISON: Add page flag for poisoned pages
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 02/22] HWPOISON: Export some rmap vma locking to outside world Wu Fengguang
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Christoph Lameter, Andi Kleen, Ingo Molnar, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: page-flag-poison --]
[-- Type: text/plain, Size: 2585 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)
v3: Define TestSetHWPoison on !CONFIG_MEMORY_FAILURE (Fengguang)

Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/linux/page-flags.h
+++ sound-2.6/include/linux/page-flags.h
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -102,6 +105,9 @@ enum pageflags {
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_hwpoison,		/* hardware poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -182,6 +188,9 @@ static inline void ClearPage##uname(stru
 #define __CLEARPAGEFLAG_NOOP(uname)					\
 static inline void __ClearPage##uname(struct page *page) {  }
 
+#define TESTSETFLAG_FALSE(uname)					\
+static inline int TestSetPage##uname(struct page *page) { return 0; }
+
 #define TESTCLEARFLAG_FALSE(uname)					\
 static inline int TestClearPage##uname(struct page *page) { return 0; }
 
@@ -265,6 +274,16 @@ PAGEFLAG(Uncached, uncached)
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+TESTSETFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -389,7 +408,7 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 02/22] HWPOISON: Export some rmap vma locking to outside world
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
  2009-06-15  2:45 ` [PATCH 01/22] HWPOISON: Add page flag for poisoned pages Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 03/22] HWPOISON: Add support for poison swap entries v2 Wu Fengguang
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: export-rmap --]
[-- Type: text/plain, Size: 1526 bytes --]

Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -116,6 +116,12 @@ int try_to_munlock(struct page *);
 int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -191,7 +191,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@ out:
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 03/22] HWPOISON: Add support for poison swap entries v2
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
  2009-06-15  2:45 ` [PATCH 01/22] HWPOISON: Add page flag for poisoned pages Wu Fengguang
  2009-06-15  2:45 ` [PATCH 02/22] HWPOISON: Export some rmap vma locking to outside world Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 04/22] HWPOISON: Add new SIGBUS error codes for hardware poison signals Wu Fengguang
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: poison-swp-entry --]
[-- Type: text/plain, Size: 3701 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Memory migration uses special swap entry types to trigger special actions on 
page faults. Extend this mechanism to also support poisoned swap entries, to 
trigger poison handling on page faults. This allows follow-on patches to 
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

--- sound-2.6.orig/include/linux/swap.h
+++ sound-2.6/include/linux/swap.h
@@ -34,16 +34,38 @@ static inline int current_is_kswapd(void
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON		MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
--- sound-2.6.orig/include/linux/swapops.h
+++ sound-2.6/include/linux/swapops.h
@@ -131,3 +131,41 @@ static inline int is_write_migration_ent
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
--- sound-2.6.orig/mm/swapfile.c
+++ sound-2.6/mm/swapfile.c
@@ -697,7 +697,7 @@ int free_swap_and_cache(swp_entry_t entr
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -2083,7 +2083,7 @@ static int __swap_duplicate(swp_entry_t 
 	int count;
 	bool has_cache;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return -EINVAL;
 
 	type = swp_type(entry);

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 04/22] HWPOISON: Add new SIGBUS error codes for hardware poison signals
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (2 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 03/22] HWPOISON: Add support for poison swap entries v2 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 05/22] HWPOISON: Add basic support for poisoned pages in fault handler v3 Wu Fengguang
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: poison-signal --]
[-- Type: text/plain, Size: 2615 bytes --]

Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/asm-generic/siginfo.h
+++ sound-2.6/include/asm-generic/siginfo.h
@@ -82,6 +82,7 @@ typedef struct siginfo {
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@ typedef struct siginfo {
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@ typedef struct siginfo {
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 05/22] HWPOISON: Add basic support for poisoned pages in fault handler v3
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (3 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 04/22] HWPOISON: Add new SIGBUS error codes for hardware poison signals Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 06/22] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 Wu Fengguang
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: vm-fault-poison --]
[-- Type: text/plain, Size: 2658 bytes --]

From: Andi Kleen <ak@linux.intel.com>

- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   18 +++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

--- sound-2.6.orig/mm/memory.c
+++ sound-2.6/mm/memory.c
@@ -1315,7 +1315,8 @@ int __get_user_pages(struct task_struct 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2595,8 +2596,15 @@ static int do_swap_page(struct mm_struct
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_hwpoison_entry(entry)) {
+			ret = VM_FAULT_HWPOISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2620,6 +2628,10 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+		goto out;
 	}
 
 	lock_page(page);
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,12 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 06/22] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (4 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 05/22] HWPOISON: Add basic support for poisoned pages in fault handler v3 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 07/22] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: x86-vmfault-poison --]
[-- Type: text/plain, Size: 2025 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

v2: Make the printk more verbose/unique

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -167,6 +167,7 @@ force_sig_info_fault(int si_signo, int s
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -798,10 +799,12 @@ out_of_memory(struct pt_regs *regs, unsi
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -817,7 +820,15 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_HWPOISON) {
+		printk(KERN_ERR
+	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+			tsk->comm, tsk->pid, address);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -827,8 +838,8 @@ mm_fault_error(struct pt_regs *regs, uns
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 07/22] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (5 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 06/22] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 08/22] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Wu Fengguang
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-remove-ifdef.patch --]
[-- Type: text/plain, Size: 1531 bytes --]

From: Wu Fengguang <fengguang.wu@intel.com>

So as to eliminate one #ifdef in the c source.

Proposed by Nick Piggin.

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 arch/x86/mm/fault.c |    3 +--
 include/linux/mm.h  |    7 ++++++-
 2 files changed, 7 insertions(+), 3 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -820,14 +820,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-#ifdef CONFIG_MEMORY_FAILURE
 	if (fault & VM_FAULT_HWPOISON) {
 		printk(KERN_ERR
 	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
 			tsk->comm, tsk->pid, address);
 		code = BUS_MCEERR_AR;
 	}
-#endif
+
 	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,16 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
-#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
+#ifdef CONFIG_MEMORY_FAILURE
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
+#else
+#define VM_FAULT_HWPOISON 0
+#endif
+
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 08/22] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (6 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 07/22] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap Wu Fengguang
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Lee.Schermerhorn, npiggin, Andi Kleen, Ingo Molnar,
	Mel Gorman, Wu, Fengguang, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: try-to-unmap-flags --]
[-- Type: text/plain, Size: 8566 bytes --]

From: Andi Kleen <ak@linux.intel.com>

try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.

This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -85,7 +85,19 @@ static inline void page_dup_rmap(struct 
  */
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -912,7 +912,7 @@ void page_remove_rmap(struct page *page)
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -934,11 +934,13 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -978,12 +980,12 @@ static int try_to_unmap_one(struct page 
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -1152,12 +1154,13 @@ static int try_to_mlock_page(struct page
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1173,7 +1176,7 @@ static int try_to_unmap_anon(struct page
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1197,8 +1200,7 @@ static int try_to_unmap_anon(struct page
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1210,7 +1212,7 @@ static int try_to_unmap_anon(struct page
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1222,6 +1224,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1234,7 +1237,7 @@ static int try_to_unmap_file(struct page
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1259,7 +1262,8 @@ static int try_to_unmap_file(struct page
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1293,7 +1297,7 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1333,7 +1337,7 @@ out:
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1344,16 +1348,16 @@ out:
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1378,8 +1382,8 @@ int try_to_munlock(struct page *page)
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
--- sound-2.6.orig/mm/vmscan.c
+++ sound-2.6/mm/vmscan.c
@@ -661,7 +661,7 @@ static unsigned long shrink_page_list(st
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
--- sound-2.6.orig/mm/migrate.c
+++ sound-2.6/mm/migrate.c
@@ -669,7 +669,7 @@ static int unmap_and_move(new_page_t get
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (7 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 08/22] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15 13:09   ` Minchan Kim
  2009-06-15  2:45 ` [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2 Wu Fengguang
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: try-to-unmap-poison --]
[-- Type: text/plain, Size: 1466 bytes --]

From: Andi Kleen <ak@linux.intel.com>

When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.

Also add a new flag to not do that (needed for the memory-failure handler
later)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_hwpoison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -94,6 +94,7 @@ enum ttu_flags {
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (8 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  9:41   ` KAMEZAWA Hiroyuki
  2009-06-15  2:45 ` [PATCH 11/22] HWPOISON: Refactor truncate to allow direct truncating of page v3 Wu Fengguang
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: free-pages-poison --]
[-- Type: text/plain, Size: 2335 bytes --]

From: Wu Fengguang <fengguang.wu@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]

v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |   20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,12 @@ static void bad_page(struct page *page)
 	static unsigned long nr_shown;
 	static unsigned long nr_unshown;
 
+	/* Don't complain about poisoned pages */
+	if (PageHWPoison(page)) {
+		__ClearPageBuddy(page);
+		return;
+	}
+
 	/*
 	 * Allow a burst of 60 reports, then keep quiet for that minute;
 	 * or allow a steady drip of one report per second.
@@ -646,7 +652,7 @@ static inline void expand(struct zone *z
 /*
  * This page is about to be returned from the page allocator
  */
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static inline int check_new_page(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
@@ -655,6 +661,18 @@ static int prep_new_page(struct page *pa
 		bad_page(page);
 		return 1;
 	}
+	return 0;
+}
+
+static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+{
+	int i;
+
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+		if (unlikely(check_new_page(p)))
+			return 1;
+	}
 
 	set_page_private(page, 0);
 	set_page_refcounted(page);

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 11/22] HWPOISON: Refactor truncate to allow direct truncating of page v3
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (9 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 12/22] HWPOISON: The high level memory error handler in the VM v7 Wu Fengguang
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Andi Kleen, Ingo Molnar, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: truncate-refactor --]
[-- Type: text/plain, Size: 2825 bytes --]

From: Nick Piggin <npiggin@suse.de>

Extract out truncate_inode_page() out of the truncate path so that
it can be used by memory-failure.c

[AK: description, headers, fix typos]
v2: Some white space changes from Fengguang Wu
v3: add comments

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/mm.h |    2 ++
 mm/truncate.c      |   34 ++++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,26 @@ invalidate_complete_page(struct address_
 	return ret;
 }
 
+/*
+ * Remove one page from its pagecache mapping. The page must be locked.
+ * This does not truncate the file on disk, it performs the pagecache
+ * side of the truncate operation. Dirty data will be discarded, and
+ * concurrent page references are ignored.
+ *
+ * Generic mm/fs code cannot call this on filesystem metadata mappings
+ * because those can assume that a page reference is enough to pin the
+ * page to its mapping.
+ */
+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+	if (page_mapped(page)) {
+		unmap_mapping_range(mapping,
+				   (loff_t)page->index << PAGE_CACHE_SHIFT,
+				   PAGE_CACHE_SIZE, 0);
+	}
+	truncate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -196,12 +216,7 @@ void truncate_inode_pages_range(struct a
 				unlock_page(page);
 				continue;
 			}
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
-			truncate_complete_page(mapping, page);
+			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
@@ -238,15 +253,10 @@ void truncate_inode_pages_range(struct a
 				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page->index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
+			truncate_inode_page(mapping, page);
 			if (page->index > next)
 				next = page->index;
 			next++;
-			truncate_complete_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -814,6 +814,8 @@ static inline void unmap_shared_mapping_
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
+void truncate_inode_page(struct address_space *mapping, struct page *page);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 12/22] HWPOISON: The high level memory error handler in the VM v7
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (10 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 11/22] HWPOISON: Refactor truncate to allow direct truncating of page v3 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 13/22] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Wu Fengguang
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, hugh.dickins, npiggin, chris.mason, Rik van Riel,
	Wu Fengguang, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andi Kleen,
	linux-mm

[-- Attachment #1: high-level --]
[-- Type: text/plain, Size: 34017 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focuses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu:
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)
v5: Improve various comments
Handle page_address_in_vma() failure better by SIGKILL and also
	make message debugging only
Clean up printks
Remove redundant PageWriteback check (Nick Piggin)
Add missing clear_page_mlock
Reformat state table to be <80 columns again
Use truncate helper instead of manual truncate in me_pagecache_*
Check for metadata buffer pages and reject them.
A few cleanups.
v6:
Fix a printk broken in the last round of cleanups.
More minor cleanups and fixes based on comments from FengguangWu.
Rename /proc/meminfo Header to "HardwareCorrupted"
Add a printk for the failed mapping case (Fengguang Wu)
Better clean page check (Fengguang Wu)
v7:
Fix kernel oops by action_result(invalid pfn) (Fengguang Wu)
make tasklist_lock/anon_vma locking order straight (Nick Piggin)
use the safer invalidate page for possible metadata pages (Nick Piggin)

Cc: hugh.dickins@tiscali.co.uk
Cc: npiggin@suse.de
Cc: chris.mason@oracle.com
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/sysctl/vm.txt |   28 +
 fs/proc/meminfo.c           |    9 
 include/linux/mm.h          |    6 
 kernel/sysctl.c             |   14 
 mm/Kconfig                  |    3 
 mm/Makefile                 |    1 
 mm/filemap.c                |    4 
 mm/memory-failure.c         |  771 ++++++++++++++++++++++++++++++++++
 mm/rmap.c                   |    4 
 mm/truncate.c               |    3 
 10 files changed, 839 insertions(+), 4 deletions(-)

--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -42,5 +42,6 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
--- /dev/null
+++ sound-2.6/mm/memory-failure.c
@@ -0,0 +1,771 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Authors: Andi Kleen, Fengguang Wu
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states.	The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#define DEBUG 1
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+	"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
+		pfn, t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	/*
+	 * Don't use force here, it's convenient if the signal
+	 * can be temporarily blocked.
+	 * This could cause a loop when the user sets SIGBUS
+	 * to SIG_IGN, but hopefully noone will do that?
+	 */
+	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+	unsigned addr_valid:1;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.	We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+			struct vm_area_struct *vma,
+			struct list_head *to_kill,
+			struct to_kill **tkc)
+{
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR
+		"MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	tk->addr_valid = 1;
+
+	/*
+	 * In theory we don't have to kill when the page was
+	 * munmaped. But it could be also a mremap. Since that's
+	 * likely very rare kill anyways just out of paranoia, but use
+	 * a SIGKILL because the error is not contained anymore.
+	 */
+	if (tk->addr == -EFAULT) {
+		pr_debug("MCE: Unable to find user space address %lx in %s\n",
+			page_to_pfn(p), tsk->comm);
+		tk->addr_valid = 0;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+			  int fail, unsigned long pfn)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. Just kill it.
+			 * the signal handlers
+			 */
+			if (fail || tk->addr_valid == 0) {
+				printk(KERN_ERR
+"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+				force_sig(SIGKILL, tk->tsk);
+			}
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+					      pfn) < 0)
+				printk(KERN_ERR
+	"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			       struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av;
+
+	read_lock(&tasklist_lock);
+
+	av = page_lock_anon_vma(page);
+	if (av == NULL)	/* Not actually mapped anymore */
+		goto out;
+
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	page_unlock_anon_vma(av);
+out:
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			       struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page_mapping(page);
+
+	/*
+	 * A note on the locking order between the two locks.
+	 * We don't rely on this particular order.
+	 * If you have some other code that needs a different order
+	 * feel free to switch them around. Or add a reverse link
+	 * from mm_struct to task_struct, then this could be all
+	 * done without taking tasklist_lock and looping over all tasks.
+	 */
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+	/* memory allocation failure is implicitly handled */
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,		/* Error handling failed */
+	DELAYED,	/* Will be handled later */
+	IGNORED,	/* Error safely ignored */
+	RECOVERED,	/* Successfully recovered */
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p, unsigned long pfn)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p, unsigned long pfn)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p, unsigned long pfn)
+{
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p, unsigned long pfn)
+{
+	return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+	struct address_space *mapping;
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	mapping = page_mapping(p);
+	if (mapping == NULL)
+		return RECOVERED;
+
+	/*
+	 * Now truncate the page in the page cache. This is really
+	 * more like a "temporary hole punch"
+	 * Don't do this for block devices when someone else
+	 * has a reference, because it could be file system metadata
+	 * and that's not safe to truncate.
+	 */
+	if (!S_ISREG(mapping->host->i_mode) &&
+	    !invalidate_complete_page(mapping, p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: failed to invalidate metadata page\n",
+			pfn);
+		return FAILED;
+	}
+
+	truncate_inode_page(mapping, p);
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
+		pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			 pfn);
+		return FAILED;
+	}
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	if (mapping) {
+		/*
+		 * IO error will be reported by write(), fsync(), etc.
+		 * who check the mapping.
+		 * This way the application knows that something went
+		 * wrong with its dirty file data.
+		 *
+		 * There's one open issue:
+		 *
+		 * The EIO will be only reported on the next IO
+		 * operation and then cleared through the IO map.
+		 * Normally Linux has two mechanisms to pass IO error
+		 * first through the AS_EIO flag in the address space
+		 * and then through the PageError flag in the page.
+		 * Since we drop pages on memory failure handling the
+		 * only mechanism open to use is through AS_AIO.
+		 *
+		 * This has the disadvantage that it gets cleared on
+		 * the first operation that returns an error, while
+		 * the PageError bit is more sticky and only cleared
+		 * when the page is reread or dropped.  If an
+		 * application assumes it will always get error on
+		 * fsync, but does other operations on the fd before
+		 * and the page is dropped inbetween then the error
+		 * will not be properly reported.
+		 *
+		 * This can already happen even without hwpoisoned
+		 * pages: first on metadata IO errors (which only
+		 * report through AS_EIO) or when the page is dropped
+		 * at the wrong time.
+		 *
+		 * So right now we assume that the application DTRT on
+		 * the first EIO, but we're not worse than other parts
+		 * of the kernel.
+		 */
+		mapping_set_error(mapping, EIO);
+	}
+
+	return me_pagecache_clean(p, pfn);
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ *      - clear dirty bit to prevent IO
+ *      - remove from LRU
+ *      - but keep in the swap cache, so that when we return to it on
+ *        a later page fault, we know the application is accessing
+ *        corrupted data and shall be killed (we installed simple
+ *        interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+{
+	ClearPageDirty(p);
+	/* Trigger EIO in shmem: */
+	ClearPageUptodate(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p, unsigned long pfn)
+{
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	delete_from_swap_cache(p);
+
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p, unsigned long pfn)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty		(1UL << PG_dirty)
+#define sc		(1UL << PG_swapcache)
+#define unevict		(1UL << PG_unevictable)
+#define mlock		(1UL << PG_mlocked)
+#define writeback	(1UL << PG_writeback)
+#define lru		(1UL << PG_lru)
+#define swapbacked	(1UL << PG_swapbacked)
+#define head		(1UL << PG_head)
+#define tail		(1UL << PG_tail)
+#define compound	(1UL << PG_compound)
+#define slab		(1UL << PG_slab)
+#define buddy		(1UL << PG_buddy)
+#define reserved	(1UL << PG_reserved)
+
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p, unsigned long pfn);
+} error_states[] = {
+	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ buddy,	buddy,		"free kernel",	me_free },
+
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ slab,		slab,		"kernel slab",	me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ head,		head,		"huge",		me_huge_page },
+	{ tail,		tail,		"huge",		me_huge_page },
+#else
+	{ compound,	compound,	"huge",		me_huge_page },
+#endif
+
+	{ sc|dirty,	sc|dirty,	"swapcache",	me_swapcache_dirty },
+	{ sc|dirty,	sc,		"swapcache",	me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ unevict|dirty, unevict|dirty,	"unevictable LRU", me_pagecache_dirty},
+	{ unevict,	unevict,	"unevictable LRU", me_pagecache_clean},
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ mlock|dirty,	mlock|dirty,	"mlocked LRU",	me_pagecache_dirty },
+	{ mlock,	mlock,		"mlocked LRU",	me_pagecache_clean },
+#endif
+
+	{ lru|dirty,	lru|dirty,	"LRU",		me_pagecache_dirty },
+	{ lru|dirty,	lru,		"clean LRU",	me_pagecache_clean },
+	{ swapbacked,	swapbacked,	"anonymous",	me_pagecache_clean },
+
+	/*
+	 * Catchall entry: must be at end.
+	 */
+	{ 0,		0,		"unknown page state",	me_unknown },
+};
+
+static void action_result(unsigned long pfn, char *msg, int result)
+{
+	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
+		pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
+		msg, action_name[result]);
+}
+
+static void page_action(struct page_state *ps, struct page *p,
+			unsigned long pfn)
+{
+	int result;
+
+	result = ps->action(p, pfn);
+	action_result(pfn, ps->msg, result);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+		       "MCE %#lx: %s page still referenced by %d users\n",
+		       pfn, ps->msg, page_count(p) - 1);
+
+	/* Could do more checks here if page looks ok */
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+				  int trapno)
+{
+	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	int kill = sysctl_memory_failure_early_kill;
+	struct address_space *mapping;
+	LIST_HEAD(tokill);
+	int ret;
+	int i;
+
+	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+		return;
+
+	if (!PageLRU(p))
+		lru_add_drain();
+
+	/*
+	 * This check implies we don't kill processes if their pages
+	 * are in the swap cache early. Those are always late kills.
+	 */
+	if (!page_mapped(p))
+		return;
+
+	if (PageSwapCache(p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		ttu |= TTU_IGNORE_HWPOISON;
+	}
+
+	/*
+	 * Propagate the dirty bit from PTEs to struct page first, because we
+	 * need this to decide if we should kill or just drop the page.
+	 */
+	mapping = page_mapping(p);
+	if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
+		if (page_mkclean(p))
+			SetPageDirty(p);
+		else {
+			kill = 0;
+			ttu |= TTU_IGNORE_HWPOISON;
+			printk(KERN_INFO
+	"MCE %#lx: corrupted page was clean: dropped without side effects\n",
+				pfn);
+		}
+	}
+
+	/*
+	 * First collect all the processes that have the page
+	 * mapped.  This has to be done before try_to_unmap,
+	 * because ttu takes the rmap data structures down.
+	 *
+	 * This also has the side effect to propagate the dirty
+	 * bit from PTEs into the struct page. This is needed
+	 * to actually decide if something needs to be killed
+	 * or errored, or if it's ok to just drop the page.
+	 *
+	 * Error handling: We ignore errors here because
+	 * there's nothing that can be done.
+	 */
+	if (kill)
+		collect_procs(p, &tokill);
+
+	/*
+	 * try_to_unmap can fail temporarily due to races.
+	 * Try a few times (RED-PEN better strategy?)
+	 */
+	for (i = 0; i < N_UNMAP_TRIES; i++) {
+		ret = try_to_unmap(p, ttu);
+		if (ret == SWAP_SUCCESS)
+			break;
+		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+	}
+
+	if (ret != SWAP_SUCCESS)
+		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
+				pfn, page_mapcount(p));
+
+	/*
+	 * Now that the dirty bit has been propagated to the
+	 * struct page and all unmaps done we can decide if
+	 * killing is needed or not.  Only kill when the page
+	 * was dirty, otherwise the tokill list is merely
+	 * freed.  When there was a problem unmapping earlier
+	 * use a more force-full uncatchable kill to prevent
+	 * any accesses to the poisoned memory.
+	 */
+	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+		      ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ *
+ * This function is called by the low level machine check code
+ * of an architecture when it detects hardware memory corruption
+ * of a page. It tries its best to recover, which includes
+ * dropping pages, killing processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Must run in process context (e.g. a work queue) with interrupts
+ * enabled and no spinlocks hold.
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	struct page_state *ps;
+	struct page *p;
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+		       "MCE %#lx: memory outside kernel control: Ignored\n",
+		       pfn);
+		return;
+	}
+
+	p = pfn_to_page(pfn);
+	if (TestSetPageHWPoison(p)) {
+		action_result(pfn, "already hardware poisoned", IGNORED);
+		return;
+	}
+
+	/*
+	 * We need/can do nothing about count=0 pages.
+	 * 1) it's a free page, and therefore in safe hand:
+	 *    prep_new_page() will be the gate keeper.
+	 * 2) it's part of a non-compound high order page.
+	 *    Implies some kernel user: cannot stop them from
+	 *    R/W the page; let's pray that the page has been
+	 *    used and will be freed some time later.
+	 * In fact it's dangerous to directly bump up page count from 0,
+	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+	 */
+	if (!get_page_unless_zero(compound_head(p))) {
+		action_result(pfn, "free or high order kernel", IGNORED);
+		return;
+	}
+
+	/*
+	 * Lock the page and wait for writeback to finish.
+	 * It's very difficult to mess with pages currently under IO
+	 * and in many cases impossible, so we just avoid it here.
+	 */
+	lock_page_nosync(p);
+	wait_on_page_writeback(p);
+
+	/*
+	 * Now take care of user space mappings.
+	 */
+	hwpoison_user_mappings(p, pfn, trapno);
+
+	/*
+	 * Torn down by someone else?
+	 */
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+		action_result(pfn, "already truncated LRU", IGNORED);
+		goto out;
+	}
+
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) == ps->res) {
+			page_action(ps, p, pfn);
+			break;
+		}
+	}
+out:
+	unlock_page(p);
+}
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -815,6 +815,7 @@ extern int vmtruncate(struct inode * ino
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
 void truncate_inode_page(struct address_space *mapping, struct page *page);
+int invalidate_complete_page(struct address_space *mapping, struct page *page);
 
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1327,5 +1328,10 @@ void vmemmap_populate_print_last(void);
 extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim,
 				 size_t size);
 extern void refund_locked_memory(struct mm_struct *mm, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
--- sound-2.6.orig/kernel/sysctl.c
+++ sound-2.6/kernel/sysctl.c
@@ -1344,6 +1344,20 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &scan_unevictable_handler,
 	},
+#ifdef CONFIG_MEMORY_FAILURE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "memory_failure_early_kill",
+		.data		= &sysctl_memory_failure_early_kill,
+		.maxlen		= sizeof(sysctl_memory_failure_early_kill),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
--- sound-2.6.orig/fs/proc/meminfo.c
+++ sound-2.6/fs/proc/meminfo.c
@@ -95,7 +95,11 @@ static int meminfo_proc_show(struct seq_
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"HardwareCorrupted: %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -140,6 +144,9 @@ static int meminfo_proc_show(struct seq_
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -239,6 +239,9 @@ config KSM
 	  Enable the KSM kernel module to allow page sharing of equal pages
 	  among different tasks.
 
+config MEMORY_FAILURE
+	bool
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
--- sound-2.6.orig/Documentation/sysctl/vm.txt
+++ sound-2.6/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/
 - legacy_va_layout
 - lowmem_reserve_ratio
 - max_map_count
+- memory_failure_early_kill
 - min_free_kbytes
 - min_slab_ratio
 - min_unmapped_ratio
@@ -53,7 +54,6 @@ Currently, these files are in /proc/sys/
 - vfs_cache_pressure
 - zone_reclaim_mode
 
-
 ==============================================================
 
 block_dump
@@ -275,6 +275,32 @@ e.g., up to one or two maps per allocati
 
 The default value is 65536.
 
+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware
+that cannot be handled by the kernel. In some cases (like the page
+still having a valid copy on disk) the kernel will handle the failure
+transparently without affecting any applications. But if there is
+no other uptodate copy of the data it will kill to prevent any data
+corruptions from propagating.
+
+1: Kill all processes that have the corrupted and not reloadable page mapped
+as soon as the corruption is detected.  Note this is not supported
+for a few types of pages, like kernel internally allocated data or
+the swap cache, but works for the majority of user pages.
+
+0: Only unmap the corrupted page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
+handle this if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
 ==============================================================
 
 min_free_kbytes:
--- sound-2.6.orig/mm/filemap.c
+++ sound-2.6/mm/filemap.c
@@ -105,6 +105,10 @@
  *
  *  ->task->proc_lock
  *    ->dcache_lock		(proc_pid_lookup)
+ *
+ *  (code doesn't rely on that order, so you could switch it around)
+ *  ->tasklist_lock             (memory_failure, collect_procs_ao)
+ *    ->i_mmap_lock
  */
 
 /*
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -36,6 +36,10 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ *   anon_vma->lock      (memory_failure, collect_procs_anon)
  */
 
 #include <linux/mm.h>
--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -118,8 +118,7 @@ truncate_complete_page(struct address_sp
  *
  * Returns non-zero if the page was successfully invalidated.
  */
-static int
-invalidate_complete_page(struct address_space *mapping, struct page *page)
+int invalidate_complete_page(struct address_space *mapping, struct page *page)
 {
 	int ret;
 

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 13/22] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (11 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 12/22] HWPOISON: The high level memory error handler in the VM v7 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 14/22] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Wu Fengguang
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: mf-inject-madvise --]
[-- Type: text/plain, Size: 2854 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman-common.h |    1 
 mm/madvise.c                      |   36 ++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

--- sound-2.6.orig/mm/madvise.c
+++ sound-2.6/mm/madvise.c
@@ -207,6 +207,38 @@ static long madvise_remove(struct vm_are
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+		 * RED-PEN page can be reused in a short window, but otherwise
+		 * we'll have to fight with the reference count.
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -307,6 +339,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, 
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON)
+		return madvise_hwpoison(start, start+len_in);
+#endif
 	if (!madvise_behavior_valid(behavior))
 		return error;
 
--- sound-2.6.orig/include/asm-generic/mman-common.h
+++ sound-2.6/include/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_HWPOISON	12		/* poison a page for testing */
 
 /* compatibility flags */
 #define MAP_FILE	0

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 14/22] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (12 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 13/22] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 15/22] HWPOISON: early kill cleanups and fixes Wu Fengguang
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: pfn-inject --]
[-- Type: text/plain, Size: 2344 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues: 
Should be disabled for cgroups.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig           |    4 ++++
 mm/Makefile          |    1 +
 mm/hwpoison-inject.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

--- /dev/null
+++ sound-2.6/mm/hwpoison-inject.c
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+	memory_failure(val, 18);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+	if (hwpoison_dir)
+		debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+	if (hwpoison_dir == NULL)
+		return -ENOMEM;
+	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+					  NULL, &hwpoison_fops);
+	if (corrupt_pfn == NULL) {
+		pfn_inject_exit();
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -242,6 +242,10 @@ config KSM
 config MEMORY_FAILURE
 	bool
 
+config HWPOISON_INJECT
+	tristate "Poison pages injector"
+	depends on MEMORY_FAILURE && DEBUG_KERNEL
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -43,5 +43,6 @@ endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 15/22] HWPOISON: early kill cleanups and fixes
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (13 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 14/22] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 16/22] mm: move page flag numbers for user space to page-flags.h Wu Fengguang
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-check-address.patch --]
[-- Type: text/plain, Size: 3217 bytes --]

- check for page_mapped_in_vma() on anon pages
- test and use page->mapping instead of page_mapping()
- cleanup some comments

If no objections, this patch will be folded into the big high-level patch.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/rmap.h |    1 +
 mm/memory-failure.c  |   20 +++++++++++---------
 mm/rmap.c            |    2 +-
 3 files changed, 13 insertions(+), 10 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -122,8 +122,6 @@ struct to_kill {
 
 /*
  * Schedule a process for later kill.
- * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
 			struct vm_area_struct *vma,
@@ -227,6 +225,9 @@ static void collect_procs_anon(struct pa
 		if (!tsk->mm)
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (!page_mapped_in_vma(page, vma))
+				continue;
+
 			if (vma->vm_mm == tsk->mm)
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
@@ -245,7 +246,7 @@ static void collect_procs_file(struct pa
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct prio_tree_iter iter;
-	struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping = page->mapping;
 
 	/*
 	 * A note on the locking order between the two locks.
@@ -275,16 +276,17 @@ static void collect_procs_file(struct pa
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
- * This is done in two steps for locking reasons.
- * First preallocate one tokill structure outside the spin locks,
- * so that we can kill at least one process reasonably reliable.
  */
 static void collect_procs(struct page *page, struct list_head *tokill)
 {
 	struct to_kill *tk;
 
-	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
-	/* memory allocation failure is implicitly handled */
+	/*
+	 * First preallocate one to_kill structure outside the spin locks,
+	 * so that we can kill at least one process reasonably reliable.
+	 */
+	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
+
 	if (PageAnon(page))
 		collect_procs_anon(page, tokill, &tk);
 	else
@@ -657,7 +659,7 @@ static void hwpoison_user_mappings(struc
 	 * Error handling: We ignore errors here because
 	 * there's nothing that can be done.
 	 */
-	if (kill)
+	if (kill && p->mapping)
 		collect_procs(p, &tokill);
 
 	/*
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -134,6 +134,7 @@ int page_wrprotect(struct page *page, in
  */
 struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 #else	/* !CONFIG_MMU */
 
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -315,7 +315,7 @@ pte_t *page_check_address(struct page *p
  * if the page is not mapped into the page tables of this VMA.  Only
  * valid for normal file or anonymous VMAs.
  */
-static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 {
 	unsigned long address;
 	pte_t *pte;

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 16/22] mm: move page flag numbers for user space to page-flags.h
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (14 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 15/22] HWPOISON: early kill cleanups and fixes Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 17/22] HWPOISON: introduce struct hwpoison_control Wu Fengguang
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: kpageflags-export-page_uflags.patch --]
[-- Type: text/plain, Size: 3212 bytes --]

We'll be exporting them in other places than /proc/kpageflags.
For example, in hwpoison uevents for describing the poisoned page.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/proc/page.c             |   40 +--------------------------------
 include/linux/page-flags.h |   42 +++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 38 deletions(-)

--- sound-2.6.orig/fs/proc/page.c
+++ sound-2.6/fs/proc/page.c
@@ -72,48 +72,12 @@ static const struct file_operations proc
 
 /* These macros are used to decouple internal flags from exported ones */
 
-#define KPF_LOCKED		0
-#define KPF_ERROR		1
-#define KPF_REFERENCED		2
-#define KPF_UPTODATE		3
-#define KPF_DIRTY		4
-#define KPF_LRU			5
-#define KPF_ACTIVE		6
-#define KPF_SLAB		7
-#define KPF_WRITEBACK		8
-#define KPF_RECLAIM		9
-#define KPF_BUDDY		10
-
-/* 11-20: new additions in 2.6.31 */
-#define KPF_MMAP		11
-#define KPF_ANON		12
-#define KPF_SWAPCACHE		13
-#define KPF_SWAPBACKED		14
-#define KPF_COMPOUND_HEAD	15
-#define KPF_COMPOUND_TAIL	16
-#define KPF_HUGE		17
-#define KPF_UNEVICTABLE		18
-#define KPF_HWPOISON		19
-#define KPF_NOPAGE		20
-
-/* kernel hacking assistances
- * WARNING: subject to change, never rely on them!
- */
-#define KPF_RESERVED		32
-#define KPF_MLOCKED		33
-#define KPF_MAPPEDTODISK	34
-#define KPF_PRIVATE		35
-#define KPF_PRIVATE_2		36
-#define KPF_OWNER_PRIVATE	37
-#define KPF_ARCH		38
-#define KPF_UNCACHED		39
-
 static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
 {
 	return ((kflags >> kbit) & 1) << ubit;
 }
 
-static u64 get_uflags(struct page *page)
+u64 page_uflags(struct page *page)
 {
 	u64 k;
 	u64 u;
@@ -214,7 +178,7 @@ static ssize_t kpageflags_read(struct fi
 		else
 			ppage = NULL;
 
-		if (put_user(get_uflags(ppage), out)) {
+		if (put_user(page_uflags(ppage), out)) {
 			ret = -EFAULT;
 			break;
 		}
--- sound-2.6.orig/include/linux/page-flags.h
+++ sound-2.6/include/linux/page-flags.h
@@ -132,6 +132,46 @@ enum pageflags {
 	PG_slub_debug = PG_error,
 };
 
+/*
+ * stable flag numbers exported to user space
+ */
+
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* 11-20: new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_HUGE		17
+#define KPF_UNEVICTABLE		18
+#define KPF_HWPOISON		19
+#define KPF_NOPAGE		20
+
+/* kernel hacking assistances
+ * WARNING: subject to change, never rely on them!
+ */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE_2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
+
 #ifndef __GENERATING_BOUNDS_H
 
 /*
@@ -284,6 +324,8 @@ TESTSETFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+u64 page_uflags(struct page *page);
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 17/22] HWPOISON: introduce struct hwpoison_control
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (15 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 16/22] mm: move page flag numbers for user space to page-flags.h Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 18/22] HWPOISON: use compound head page Wu Fengguang
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-control.patch --]
[-- Type: text/plain, Size: 9286 bytes --]

Code cleanups to allow passing around more parameters and states.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   94 ++++++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 40 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -298,26 +298,32 @@ static void collect_procs(struct page *p
  * Error handlers for various types of pages.
  */
 
-enum outcome {
+enum hwpoison_outcome {
 	FAILED,		/* Error handling failed */
 	DELAYED,	/* Will be handled later */
 	IGNORED,	/* Error safely ignored */
 	RECOVERED,	/* Successfully recovered */
 };
 
-static const char *action_name[] = {
+static const char *hwpoison_outcome_name[] = {
 	[FAILED] = "Failed",
 	[DELAYED] = "Delayed",
 	[IGNORED] = "Ignored",
 	[RECOVERED] = "Recovered",
 };
 
+struct hwpoison_control {
+	unsigned long pfn;
+	struct page *page;
+	int outcome;
+};
+
 /*
  * Error hit kernel page.
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
  * could be more sophisticated.
  */
-static int me_kernel(struct page *p, unsigned long pfn)
+static int me_kernel(struct hwpoison_control *hpc)
 {
 	return DELAYED;
 }
@@ -325,7 +331,7 @@ static int me_kernel(struct page *p, uns
 /*
  * Already poisoned page.
  */
-static int me_ignore(struct page *p, unsigned long pfn)
+static int me_ignore(struct hwpoison_control *hpc)
 {
 	return IGNORED;
 }
@@ -333,16 +339,16 @@ static int me_ignore(struct page *p, uns
 /*
  * Page in unknown state. Do nothing.
  */
-static int me_unknown(struct page *p, unsigned long pfn)
+static int me_unknown(struct hwpoison_control *hpc)
 {
-	printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", hpc->pfn);
 	return FAILED;
 }
 
 /*
  * Free memory
  */
-static int me_free(struct page *p, unsigned long pfn)
+static int me_free(struct hwpoison_control *hpc)
 {
 	return DELAYED;
 }
@@ -350,9 +356,10 @@ static int me_free(struct page *p, unsig
 /*
  * Clean (or cleaned) page cache page.
  */
-static int me_pagecache_clean(struct page *p, unsigned long pfn)
+static int me_pagecache_clean(struct hwpoison_control *hpc)
 {
 	struct address_space *mapping;
+	struct page *p = hpc->page;
 
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
@@ -372,14 +379,14 @@ static int me_pagecache_clean(struct pag
 	    !invalidate_complete_page(mapping, p)) {
 		printk(KERN_ERR
 		       "MCE %#lx: failed to invalidate metadata page\n",
-			pfn);
+			hpc->pfn);
 		return FAILED;
 	}
 
 	truncate_inode_page(mapping, p);
 	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
 		pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
-			 pfn);
+			 hpc->pfn);
 		return FAILED;
 	}
 	return RECOVERED;
@@ -390,11 +397,11 @@ static int me_pagecache_clean(struct pag
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
  */
-static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+static int me_pagecache_dirty(struct hwpoison_control *hpc)
 {
-	struct address_space *mapping = page_mapping(p);
+	struct address_space *mapping = page_mapping(hpc->page);
 
-	SetPageError(p);
+	SetPageError(hpc->page);
 	/* TBD: print more information about the file. */
 	if (mapping) {
 		/*
@@ -434,7 +441,7 @@ static int me_pagecache_dirty(struct pag
 		mapping_set_error(mapping, EIO);
 	}
 
-	return me_pagecache_clean(p, pfn);
+	return me_pagecache_clean(hpc);
 }
 
 /*
@@ -456,8 +463,10 @@ static int me_pagecache_dirty(struct pag
  * Clean swap cache pages can be directly isolated. A later page fault will
  * bring in the known good data from disk.
  */
-static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+static int me_swapcache_dirty(struct hwpoison_control *hpc)
 {
+	struct page *p = hpc->page;
+
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
 	ClearPageUptodate(p);
@@ -468,8 +477,10 @@ static int me_swapcache_dirty(struct pag
 	return DELAYED;
 }
 
-static int me_swapcache_clean(struct page *p, unsigned long pfn)
+static int me_swapcache_clean(struct hwpoison_control *hpc)
 {
+	struct page *p = hpc->page;
+
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
@@ -489,7 +500,7 @@ static int me_swapcache_clean(struct pag
  * Should handle free huge pages and dequeue them too, but this needs to
  * handle huge page accounting correctly.
  */
-static int me_huge_page(struct page *p, unsigned long pfn)
+static int me_huge_page(struct hwpoison_control *hpc)
 {
 	return FAILED;
 }
@@ -525,7 +536,7 @@ static struct page_state {
 	unsigned long mask;
 	unsigned long res;
 	char *msg;
-	int (*action)(struct page *p, unsigned long pfn);
+	int (*action)(struct hwpoison_control *hpc);
 } error_states[] = {
 	{ reserved,	reserved,	"reserved kernel",	me_ignore },
 	{ buddy,	buddy,		"free kernel",	me_free },
@@ -567,24 +578,22 @@ static struct page_state {
 	{ 0,		0,		"unknown page state",	me_unknown },
 };
 
-static void action_result(unsigned long pfn, char *msg, int result)
+static void action_result(struct hwpoison_control *hpc, char *msg, int result)
 {
+	hpc->outcome = result;
 	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
-		pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
-		msg, action_name[result]);
+		hpc->pfn, PageDirty(hpc->page) ? "dirty " : "",
+		msg, hwpoison_outcome_name[result]);
 }
 
-static void page_action(struct page_state *ps, struct page *p,
-			unsigned long pfn)
+static void page_action(struct page_state *ps, struct hwpoison_control *hpc)
 {
-	int result;
+	action_result(hpc, ps->msg, ps->action(hpc));
 
-	result = ps->action(p, pfn);
-	action_result(pfn, ps->msg, result);
-	if (page_count(p) != 1)
+	if (page_count(hpc->page) != 1)
 		printk(KERN_ERR
 		       "MCE %#lx: %s page still referenced by %d users\n",
-		       pfn, ps->msg, page_count(p) - 1);
+		       hpc->pfn, ps->msg, page_count(hpc->page) - 1);
 
 	/* Could do more checks here if page looks ok */
 	atomic_long_add(1, &mce_bad_pages);
@@ -600,12 +609,12 @@ static void page_action(struct page_stat
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
  */
-static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
-				  int trapno)
+static void hwpoison_user_mappings(struct hwpoison_control *hpc, int trapno)
 {
 	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	int kill = sysctl_memory_failure_early_kill;
 	struct address_space *mapping;
+	struct page *p = hpc->page;
 	LIST_HEAD(tokill);
 	int ret;
 	int i;
@@ -625,7 +634,8 @@ static void hwpoison_user_mappings(struc
 
 	if (PageSwapCache(p)) {
 		printk(KERN_ERR
-		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		       "MCE %#lx: keeping poisoned page in swap cache\n",
+		       hpc->pfn);
 		ttu |= TTU_IGNORE_HWPOISON;
 	}
 
@@ -642,7 +652,7 @@ static void hwpoison_user_mappings(struc
 			ttu |= TTU_IGNORE_HWPOISON;
 			printk(KERN_INFO
 	"MCE %#lx: corrupted page was clean: dropped without side effects\n",
-				pfn);
+				hpc->pfn);
 		}
 	}
 
@@ -670,12 +680,13 @@ static void hwpoison_user_mappings(struc
 		ret = try_to_unmap(p, ttu);
 		if (ret == SWAP_SUCCESS)
 			break;
-		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n",
+			 hpc->pfn, ret);
 	}
 
 	if (ret != SWAP_SUCCESS)
 		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
-				pfn, page_mapcount(p));
+				hpc->pfn, page_mapcount(p));
 
 	/*
 	 * Now that the dirty bit has been propagated to the
@@ -687,7 +698,7 @@ static void hwpoison_user_mappings(struc
 	 * any accesses to the poisoned memory.
 	 */
 	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
-		      ret != SWAP_SUCCESS, pfn);
+		      ret != SWAP_SUCCESS, hpc->pfn);
 }
 
 /**
@@ -711,6 +722,7 @@ void memory_failure(unsigned long pfn, i
 {
 	struct page_state *ps;
 	struct page *p;
+	struct hwpoison_control hpc;
 
 	if (!pfn_valid(pfn)) {
 		printk(KERN_ERR
@@ -720,8 +732,10 @@ void memory_failure(unsigned long pfn, i
 	}
 
 	p = pfn_to_page(pfn);
+	hpc.pfn = pfn;
+	hpc.page = p;
 	if (TestSetPageHWPoison(p)) {
-		action_result(pfn, "already hardware poisoned", IGNORED);
+		action_result(&hpc, "already hardware poisoned", IGNORED);
 		return;
 	}
 
@@ -737,7 +751,7 @@ void memory_failure(unsigned long pfn, i
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!get_page_unless_zero(compound_head(p))) {
-		action_result(pfn, "free or high order kernel", IGNORED);
+		action_result(&hpc, "free or high order kernel", IGNORED);
 		return;
 	}
 
@@ -752,19 +766,19 @@ void memory_failure(unsigned long pfn, i
 	/*
 	 * Now take care of user space mappings.
 	 */
-	hwpoison_user_mappings(p, pfn, trapno);
+	hwpoison_user_mappings(&hpc, trapno);
 
 	/*
 	 * Torn down by someone else?
 	 */
 	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
-		action_result(pfn, "already truncated LRU", IGNORED);
+		action_result(&hpc, "already truncated LRU", IGNORED);
 		goto out;
 	}
 
 	for (ps = error_states;; ps++) {
 		if ((p->flags & ps->mask) == ps->res) {
-			page_action(ps, p, pfn);
+			page_action(ps, &hpc);
 			break;
 		}
 	}

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 18/22] HWPOISON: use compound head page
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (16 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 17/22] HWPOISON: introduce struct hwpoison_control Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 19/22] HWPOISON: detect free buddy pages explicitly Wu Fengguang
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-compound-page-head.patch --]
[-- Type: text/plain, Size: 1480 bytes --]

In most places we want to test/operate on the compound head page. 
The raw poisoned page is recorded in hwpoison_control.p for others.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -314,7 +314,8 @@ static const char *hwpoison_outcome_name
 
 struct hwpoison_control {
 	unsigned long pfn;
-	struct page *page;
+	struct page *p;		/* corrupted page */
+	struct page *page;	/* compound page head */
 	int outcome;
 };
 
@@ -732,13 +733,17 @@ void memory_failure(unsigned long pfn, i
 	}
 
 	p = pfn_to_page(pfn);
-	hpc.pfn = pfn;
-	hpc.page = p;
 	if (TestSetPageHWPoison(p)) {
-		action_result(&hpc, "already hardware poisoned", IGNORED);
+		printk(KERN_ERR
+		       "MCE %#lx: already hardware poisoned: Ignored\n",
+		       pfn);
 		return;
 	}
 
+	hpc.pfn  = pfn;
+	hpc.p    = p;
+	hpc.page = p = compound_head(p);
+
 	/*
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:
@@ -750,7 +755,7 @@ void memory_failure(unsigned long pfn, i
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
-	if (!get_page_unless_zero(compound_head(p))) {
+	if (!get_page_unless_zero(p)) {
 		action_result(&hpc, "free or high order kernel", IGNORED);
 		return;
 	}

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 19/22] HWPOISON: detect free buddy pages explicitly
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (17 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 18/22] HWPOISON: use compound head page Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 20/22] HWPOISON: collect infos that reflect the impact of the memory corruption Wu Fengguang
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-is-free-page.patch --]
[-- Type: text/plain, Size: 3184 bytes --]

The free pages in the buddy system may well have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/internal.h       |    3 +++
 mm/memory-failure.c |   24 +++++++++---------------
 mm/page_alloc.c     |   16 ++++++++++++++++
 3 files changed, 28 insertions(+), 15 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -324,15 +324,15 @@ struct hwpoison_control {
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
  * could be more sophisticated.
  */
-static int me_kernel(struct hwpoison_control *hpc)
+static int me_slab(struct hwpoison_control *hpc)
 {
 	return DELAYED;
 }
 
 /*
- * Already poisoned page.
+ * Reserved kernel page.
  */
-static int me_ignore(struct hwpoison_control *hpc)
+static int me_reserved(struct hwpoison_control *hpc)
 {
 	return IGNORED;
 }
@@ -347,14 +347,6 @@ static int me_unknown(struct hwpoison_co
 }
 
 /*
- * Free memory
- */
-static int me_free(struct hwpoison_control *hpc)
-{
-	return DELAYED;
-}
-
-/*
  * Clean (or cleaned) page cache page.
  */
 static int me_pagecache_clean(struct hwpoison_control *hpc)
@@ -539,15 +531,14 @@ static struct page_state {
 	char *msg;
 	int (*action)(struct hwpoison_control *hpc);
 } error_states[] = {
-	{ reserved,	reserved,	"reserved kernel",	me_ignore },
-	{ buddy,	buddy,		"free kernel",	me_free },
+	{ reserved,	reserved,	"reserved kernel", me_reserved },
 
 	/*
 	 * Could in theory check if slab page is free or if we can drop
 	 * currently unused objects without touching them. But just
 	 * treat it as standard kernel for now.
 	 */
-	{ slab,		slab,		"kernel slab",	me_kernel },
+	{ slab,		slab,		"kernel slab",	me_slab },
 
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		"huge",		me_huge_page },
@@ -756,7 +747,10 @@ void memory_failure(unsigned long pfn, i
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!get_page_unless_zero(p)) {
-		action_result(&hpc, "free or high order kernel", IGNORED);
+		if (is_free_buddy_page(p))
+			action_result(&hpc, "free buddy", DELAYED);
+		else
+			action_result(&hpc, "high order kernel", IGNORED);
 		return;
 	}
 
--- sound-2.6.orig/mm/internal.h
+++ sound-2.6/mm/internal.h
@@ -49,6 +49,9 @@ extern void putback_lru_page(struct page
 extern unsigned long highest_memmap_pfn;
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+#ifdef CONFIG_MEMORY_FAILURE
+extern bool is_free_buddy_page(struct page *page);
+#endif
 
 
 /*
--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -4966,3 +4966,19 @@ __offline_isolated_pages(unsigned long s
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 #endif
+
+#ifdef CONFIG_MEMORY_FAILURE
+bool is_free_buddy_page(struct page *page)
+{
+	int pfn = page_to_pfn(page);
+	int order;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		struct page *page_head = page - (pfn & ((1 << order) - 1));
+
+		if (PageBuddy(page_head) && page_order(page_head) >= order)
+			return true;
+	}
+	return false;
+}
+#endif

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 20/22] HWPOISON: collect infos that reflect the impact of the memory corruption
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (18 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 19/22] HWPOISON: detect free buddy pages explicitly Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  2:45 ` [PATCH 21/22] HWPOISON: send uevent to report " Wu Fengguang
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-safety-bits.patch --]
[-- Type: text/plain, Size: 4679 bytes --]

When a page corrupted, users may care about
- does it hit some important areas?
- can its data be recovered?
- can it be isolated to avoid a deadly future reference?
so that they can take proper actions like emergency sync/shutdown or
schedule reboot at some convenient time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   78 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 66 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -312,11 +312,32 @@ static const char *hwpoison_outcome_name
 	[RECOVERED] = "Recovered",
 };
 
+enum hwpoison_page_type {
+	PAGE_IS_KERNEL,
+	PAGE_IS_FS_METADATA,
+	PAGE_IS_FILE_DATA,
+	PAGE_IS_ANON_DATA,
+	PAGE_IS_SWAP_CACHE,
+	PAGE_IS_FREE,
+};
+
+static const char *hwpoison_page_type_name[] = {
+	[ PAGE_IS_KERNEL ]	= "kernel",
+	[ PAGE_IS_FS_METADATA ]	= "fs_metadata",
+	[ PAGE_IS_FILE_DATA ]	= "file_data",
+	[ PAGE_IS_ANON_DATA ]	= "anon_data",
+	[ PAGE_IS_SWAP_CACHE ]	= "swap_cache",
+	[ PAGE_IS_FREE ]	= "free",
+};
+
 struct hwpoison_control {
 	unsigned long pfn;
 	struct page *p;		/* corrupted page */
 	struct page *page;	/* compound page head */
 	int outcome;
+	int page_type;
+	unsigned data_recoverable:1;
+	unsigned page_isolated:1;
 };
 
 /*
@@ -358,8 +379,14 @@ static int me_pagecache_clean(struct hwp
 		page_cache_release(p);
 
 	mapping = page_mapping(p);
-	if (mapping == NULL)
+	if (mapping == NULL) {
+		hpc->page_isolated = 1;
 		return RECOVERED;
+	}
+
+	/* clean file backed page is recoverable */
+	if (!PageDirty(p) && !PageSwapBacked(p))
+		hpc->data_recoverable = 1;
 
 	/*
 	 * Now truncate the page in the page cache. This is really
@@ -368,12 +395,14 @@ static int me_pagecache_clean(struct hwp
 	 * has a reference, because it could be file system metadata
 	 * and that's not safe to truncate.
 	 */
-	if (!S_ISREG(mapping->host->i_mode) &&
-	    !invalidate_complete_page(mapping, p)) {
-		printk(KERN_ERR
-		       "MCE %#lx: failed to invalidate metadata page\n",
-			hpc->pfn);
-		return FAILED;
+	if (!S_ISREG(mapping->host->i_mode)) {
+		hpc->page_type = PAGE_IS_FS_METADATA;
+		if (!invalidate_complete_page(mapping, p)) {
+			printk(KERN_ERR
+			       "MCE %#lx: failed to invalidate metadata page\n",
+			       hpc->pfn);
+			return FAILED;
+		}
 	}
 
 	truncate_inode_page(mapping, p);
@@ -382,6 +411,8 @@ static int me_pagecache_clean(struct hwp
 			 hpc->pfn);
 		return FAILED;
 	}
+
+	hpc->page_isolated = 1;
 	return RECOVERED;
 }
 
@@ -467,6 +498,7 @@ static int me_swapcache_dirty(struct hwp
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
+	hpc->page_isolated = 1;
 	return DELAYED;
 }
 
@@ -478,6 +510,8 @@ static int me_swapcache_clean(struct hwp
 		page_cache_release(p);
 
 	delete_from_swap_cache(p);
+	hpc->data_recoverable = 1;
+	hpc->page_isolated = 1;
 
 	return RECOVERED;
 }
@@ -587,6 +621,10 @@ static void page_action(struct page_stat
 		       "MCE %#lx: %s page still referenced by %d users\n",
 		       hpc->pfn, ps->msg, page_count(hpc->page) - 1);
 
+	if (page_count(hpc->page) > 1 ||
+	    page_mapcount(hpc->page) > 0)
+		hpc->page_isolated = 0;
+
 	/* Could do more checks here if page looks ok */
 	atomic_long_add(1, &mce_bad_pages);
 
@@ -735,6 +773,10 @@ void memory_failure(unsigned long pfn, i
 	hpc.p    = p;
 	hpc.page = p = compound_head(p);
 
+	hpc.page_type = PAGE_IS_KERNEL;
+	hpc.data_recoverable = 0;
+	hpc.page_isolated = 0;
+
 	/*
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:
@@ -747,9 +789,12 @@ void memory_failure(unsigned long pfn, i
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!get_page_unless_zero(p)) {
-		if (is_free_buddy_page(p))
+		if (is_free_buddy_page(p)) {
+			hpc.page_type = PAGE_IS_FREE;
+			hpc.data_recoverable = 1;
+			hpc.page_isolated = 1;
 			action_result(&hpc, "free buddy", DELAYED);
-		else
+		} else
 			action_result(&hpc, "high order kernel", IGNORED);
 		return;
 	}
@@ -770,9 +815,18 @@ void memory_failure(unsigned long pfn, i
 	/*
 	 * Torn down by someone else?
 	 */
-	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
-		action_result(&hpc, "already truncated LRU", IGNORED);
-		goto out;
+	if (PageLRU(p)) {
+		if (PageSwapCache(p))
+			hpc.page_type = PAGE_IS_SWAP_CACHE;
+		else if (PageAnon(p))
+			hpc.page_type = PAGE_IS_ANON_DATA;
+		else
+			hpc.page_type = PAGE_IS_FILE_DATA;
+		if (!PageSwapCache(p) && p->mapping == NULL) {
+			action_result(&hpc, "already truncated LRU", IGNORED);
+			hpc.page_type = PAGE_IS_FREE;
+			goto out;
+		}
 	}
 
 	for (ps = error_states;; ps++) {

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 21/22] HWPOISON: send uevent to report memory corruption
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (19 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 20/22] HWPOISON: collect infos that reflect the impact of the memory corruption Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  6:29   ` Andi Kleen
  2009-06-16  0:35   ` Greg KH
  2009-06-15  2:45 ` [PATCH 22/22] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Wu Fengguang
  2009-06-15  3:18 ` [PATCH 00/22] HWPOISON: Intro (v5) Balbir Singh
  22 siblings, 2 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-uevent.patch --]
[-- Type: text/plain, Size: 7009 bytes --]

This allows the user space to do some flexible policies.
For example, it may either do emergency sync/shutdown
or to schedule reboot at some convenient time, depending
on the severeness of the corruption.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/memory-failure |   68 ++++++++++++++++++
 mm/memory-failure.c             |  110 +++++++++++++++++++++++++++++-
 2 files changed, 175 insertions(+), 3 deletions(-)

--- /dev/null
+++ sound-2.6/Documentation/vm/memory-failure
@@ -0,0 +1,68 @@
+Memory failure and hardware poison events
+
+Memory may have soft errors and the more memory you have the more errors.
+Normally hardware hides that from you by correcting it, but in some cases you
+can get multi-bit errors which lead to uncorrected errors the hardware cannot
+hide.
+
+This does not necessarily mean that the hardware is broken; for example it can
+be caused by cosmic particles hitting a unlucky transistor. So it can really
+happen in normal operation.
+
+Some hardwares (eg. Nehalem-EX) support background memory scrubbing in order to
+report the memory corruption before they are consumed. The kernel will then try
+to isolate the corrupted memory page, restore data, and finally send a uevent
+to the user space.
+
+A memory poison uevent will be
+
+  # udevadm monitor --environment --kernel
+  KERNEL[1245030313.702625] change   /kernel/mm/hwpoison/hwpoison (hwpoison)
+  UDEV_LOG=3
+  ACTION=change
+  DEVPATH=/kernel/mm/hwpoison/hwpoison
+  SUBSYSTEM=hwpoison
+  EVENT=poison
+  PHYS_ADDR=0x19e1c000
+  PAGE_FLAGS=0x80008083c
+  PAGE_COUNT=3
+  PAGE_MAPCOUNT=1
+  PAGE_DEV=8:2
+  PAGE_INODE=56169
+  PAGE_INDEX=9
+  PAGE_TYPE=file_data
+  PAGE_ISOLATED=1
+  DATA_RECOVERABLE=0
+  SEQNUM=2109
+
+where
+
+  PHYS_ADDR	the physical page address
+  PAGE_FLAGS	the kpageflags bits defined at Documentation/vm/pagemap.txt
+  PAGE_COUNT	the original page reference count
+  PAGE_MAPCOUNT	the original page map count
+
+  PAGE_TYPE	where the error lands, can be one of
+    "kernel"      - a kernel page that may contain some critical data structure
+    "fs_metadata" - a filesystem metadata page
+    "file_data"   - a file data page
+    "anon_data"   - a page belong to some process(es)
+    "swap_cache"  - it's in the swap cache; the kernel cannot tell if it was an
+                    anon_data page or a tmpfs' file_data page
+    "free"        - a free page; not used by anyone
+
+For "file_data" pages, the following three vars are available:
+
+  PAGE_DEV	the file's MAJOR:MINOR device numbers in decimal
+  PAGE_INODE	the file's inode number in decimal
+  PAGE_INDEX	the file offset in page size
+
+  PAGE_ISOLATED if 1, we are sure that the page won't be consumed in the future.
+                if 0, the error page is still referenced by someone, and may be
+		consumed at anytime, which will be detected/stopped by hardware,
+		and trigger instant machine reboot.
+
+  DATA_RECOVERABLE if 1, no data are lost. For example, it's a free page, or a
+                   clean page whose data can be reloaded from disk. In these
+		   cases, the user space will not see the error at all.
+
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -330,7 +330,11 @@ static const char *hwpoison_page_type_na
 	[ PAGE_IS_FREE ]	= "free",
 };
 
+static struct kset *hwpoison_kset;
+static struct kobject hwpoison_kobj;
+
 struct hwpoison_control {
+	struct kobj_uevent_env *env;
 	unsigned long pfn;
 	struct page *p;		/* corrupted page */
 	struct page *page;	/* compound page head */
@@ -340,6 +344,51 @@ struct hwpoison_control {
 	unsigned page_isolated:1;
 };
 
+static void hwpoison_uevent_page(struct hwpoison_control *hpc)
+{
+	struct page *p = hpc->page;
+
+	if (hpc->env == NULL)
+		return;
+
+	add_uevent_var(hpc->env, "EVENT=poison");
+	add_uevent_var(hpc->env, "PHYS_ADDR=%#lx", hpc->pfn << PAGE_SHIFT);
+	add_uevent_var(hpc->env, "PAGE_FLAGS=%#Lx", page_uflags(p));
+	add_uevent_var(hpc->env, "PAGE_COUNT=%d", page_count(p));
+	add_uevent_var(hpc->env, "PAGE_MAPCOUNT=%d", page_mapcount(p));
+}
+
+static void hwpoison_uevent_file(struct hwpoison_control *hpc)
+{
+	struct address_space *mapping = page_mapping(hpc->page);
+
+	if (hpc->env == NULL)
+		return;
+
+	if (!mapping || !mapping->host)
+		return;
+
+	add_uevent_var(hpc->env, "PAGE_DEV=%d:%d",
+		       MAJOR(mapping->host->i_sb->s_dev),
+		       MINOR(mapping->host->i_sb->s_dev));
+	add_uevent_var(hpc->env, "PAGE_INODE=%lu", mapping->host->i_ino);
+	add_uevent_var(hpc->env, "PAGE_INDEX=%lu", hpc->page->index);
+}
+
+static void hwpoison_uevent_send(struct hwpoison_control *hpc)
+{
+	if (hpc->env == NULL)
+		return;
+
+	add_uevent_var(hpc->env, "PAGE_TYPE=%s",
+		       hwpoison_page_type_name[hpc->page_type]);
+	add_uevent_var(hpc->env, "PAGE_ISOLATED=%d",
+		       hpc->page_isolated);
+	add_uevent_var(hpc->env, "DATA_RECOVERABLE=%d",
+		       hpc->data_recoverable);
+	kobject_uevent_env(&hwpoison_kobj, KOBJ_CHANGE, hpc->env->envp);
+}
+
 /*
  * Error hit kernel page.
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
@@ -769,10 +818,19 @@ void memory_failure(unsigned long pfn, i
 		return;
 	}
 
+	hpc.env = kzalloc(sizeof(struct kobj_uevent_env), GFP_NOIO);
+	if (!hpc.env) {
+		printk(KERN_ERR
+		       "MCE %#lx: cannot allocate memory for uevent\n",
+		       pfn);
+	}
+
 	hpc.pfn  = pfn;
 	hpc.p    = p;
 	hpc.page = p = compound_head(p);
 
+	hwpoison_uevent_page(&hpc);
+
 	hpc.page_type = PAGE_IS_KERNEL;
 	hpc.data_recoverable = 0;
 	hpc.page_isolated = 0;
@@ -796,7 +854,7 @@ void memory_failure(unsigned long pfn, i
 			action_result(&hpc, "free buddy", DELAYED);
 		} else
 			action_result(&hpc, "high order kernel", IGNORED);
-		return;
+		goto out;
 	}
 
 	/*
@@ -825,16 +883,62 @@ void memory_failure(unsigned long pfn, i
 		if (!PageSwapCache(p) && p->mapping == NULL) {
 			action_result(&hpc, "already truncated LRU", IGNORED);
 			hpc.page_type = PAGE_IS_FREE;
-			goto out;
+			goto out_unlock;
 		}
 	}
 
+	hwpoison_uevent_file(&hpc);
+
 	for (ps = error_states;; ps++) {
 		if ((p->flags & ps->mask) == ps->res) {
 			page_action(ps, &hpc);
 			break;
 		}
 	}
-out:
+out_unlock:
 	unlock_page(p);
+out:
+	hwpoison_uevent_send(&hpc);
+}
+
+static void hwpoison_release(struct kobject *kobj)
+{
+}
+
+static struct kobj_type hwpoison_ktype = {
+	.release = hwpoison_release,
+};
+
+static int hwpoison_kobj_init(void)
+{
+	int err;
+
+	hwpoison_kset = kset_create_and_add("hwpoison", NULL, mm_kobj);
+	if (!hwpoison_kset)
+		return -ENOMEM;
+
+	hwpoison_kobj.kset = hwpoison_kset;
+
+	err = kobject_init_and_add(&hwpoison_kobj, &hwpoison_ktype, NULL,
+				   "hwpoison");
+	if (err)
+		return -ENOMEM;
+
+	kobject_uevent(&hwpoison_kobj, KOBJ_ADD);
+
+	return 0;
 }
+
+
+static int __init hwpoison_init(void)
+{
+	return hwpoison_kobj_init();
+}
+
+static void __exit hwpoison_exit(void)
+{
+	kset_unregister(hwpoison_kset);
+}
+
+module_init(hwpoison_init);
+module_exit(hwpoison_exit);

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 22/22] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (20 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 21/22] HWPOISON: send uevent to report " Wu Fengguang
@ 2009-06-15  2:45 ` Wu Fengguang
  2009-06-15  3:18 ` [PATCH 00/22] HWPOISON: Intro (v5) Balbir Singh
  22 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-enable --]
[-- Type: text/plain, Size: 557 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -241,6 +241,8 @@ config KSM
 
 config MEMORY_FAILURE
 	bool
+	default y
+	depends on MMU
 
 config HWPOISON_INJECT
 	tristate "Poison pages injector"

-- 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
                   ` (21 preceding siblings ...)
  2009-06-15  2:45 ` [PATCH 22/22] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Wu Fengguang
@ 2009-06-15  3:18 ` Balbir Singh
  2009-06-15  4:27   ` Wu Fengguang
  22 siblings, 1 reply; 79+ messages in thread
From: Balbir Singh @ 2009-06-15  3:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

Wu Fengguang wrote:
> Hi all,
> 
> Comments are warmly welcome on the newly introduced uevent code :)
> 
> I hope we can reach consensus in this round and then be able to post
> a final version for .31 inclusion.

Isn't that too aggressive? .31 is already in the merge window.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  3:18 ` [PATCH 00/22] HWPOISON: Intro (v5) Balbir Singh
@ 2009-06-15  4:27   ` Wu Fengguang
  2009-06-15  6:44     ` Nick Piggin
  2009-06-15 12:51     ` Hugh Dickins
  0 siblings, 2 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  4:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 11:18:18AM +0800, Balbir Singh wrote:
> Wu Fengguang wrote:
> > Hi all,
> > 
> > Comments are warmly welcome on the newly introduced uevent code :)
> > 
> > I hope we can reach consensus in this round and then be able to post
> > a final version for .31 inclusion.
> 
> Isn't that too aggressive? .31 is already in the merge window.

Yes, a bit aggressive. This is a new feature that involves complex logics.
However it is basically a no-op when there are no memory errors,
and when memory corruption does occur, it's better to (possibly) panic
in this code than to panic unconditionally in the absence of this
feature (as said by Rik).

So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.

It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 21/22] HWPOISON: send uevent to report memory corruption
  2009-06-15  2:45 ` [PATCH 21/22] HWPOISON: send uevent to report " Wu Fengguang
@ 2009-06-15  6:29   ` Andi Kleen
  2009-06-15  9:56     ` Wu Fengguang
  2009-06-16  0:35   ` Greg KH
  1 sibling, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2009-06-15  6:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 10:45:41AM +0800, Wu Fengguang wrote:
> This allows the user space to do some flexible policies.
> For example, it may either do emergency sync/shutdown
> or to schedule reboot at some convenient time, depending
> on the severeness of the corruption.
> 

I don't think it's a good idea to export that much detailed information.
That would become a stable ABI, but might not be possible to keep
all these details stable. e.g. map count or reference count are
internal implementation details that shouldn't be exposed.
And what is an user space application to do with the inode? Run
find -inum? 

Also we already report the event using low level logging mechanism.
in a relatively stable form.

It's also unclear to me what an application would do with that much
detail.

I would suggest to drop this part and the earlier flags move.

Please only bug fixes are this stage.

-Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  4:27   ` Wu Fengguang
@ 2009-06-15  6:44     ` Nick Piggin
  2009-06-15  7:09       ` Andi Kleen
  2009-06-15  8:14       ` [PATCH 00/22] HWPOISON: Intro (v5) Nick Piggin
  2009-06-15 12:51     ` Hugh Dickins
  1 sibling, 2 replies; 79+ messages in thread
From: Nick Piggin @ 2009-06-15  6:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Balbir Singh, Andrew Morton, LKML, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 12:27:53PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 11:18:18AM +0800, Balbir Singh wrote:
> > Wu Fengguang wrote:
> > > Hi all,
> > > 
> > > Comments are warmly welcome on the newly introduced uevent code :)
> > > 
> > > I hope we can reach consensus in this round and then be able to post
> > > a final version for .31 inclusion.
> > 
> > Isn't that too aggressive? .31 is already in the merge window.
> 
> Yes, a bit aggressive. This is a new feature that involves complex logics.
> However it is basically a no-op when there are no memory errors,
> and when memory corruption does occur, it's better to (possibly) panic
> in this code than to panic unconditionally in the absence of this
> feature (as said by Rik).
> 
> So IMHO it's OK for .31 as long as we agree on the user interfaces,
> ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> 
> It comes a long way through numerous reviews, and I believe all the
> important issues and concerns have been addressed. Nick, Rik, Hugh,
> Ingo, ... what are your opinions? Is the uevent good enough to meet
> your request to "die hard" or "die gracefully" or whatever on memory
> failure events?

Uevent? As in, send a message to userspace? I don't think this
would be ideal for a fail-stop/failover situation.

I can't see a good reason to rush to merge it.

IMO the userspace-visible changes have maybe not been considered
too thoroughly, which is what I'd be most worried about. I probably
missed seeing documentation of exact semantics and situations
where admins should tune things one way or the other.

Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?

I hope it is going to be merged with an easy-to-use fault injector,
because that is the only way Joe kernel developer is ever going to
test it.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  6:44     ` Nick Piggin
@ 2009-06-15  7:09       ` Andi Kleen
  2009-06-15  7:19         ` Nick Piggin
  2009-06-15  8:14       ` [PATCH 00/22] HWPOISON: Intro (v5) Nick Piggin
  1 sibling, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2009-06-15  7:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Wu Fengguang, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > 
> > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> > 
> > It comes a long way through numerous reviews, and I believe all the
> > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > your request to "die hard" or "die gracefully" or whatever on memory
> > failure events?
> 
> Uevent? As in, send a message to userspace? I don't think this
> would be ideal for a fail-stop/failover situation.

Agreed.

For failover you typically want a application level heartbeat anyways
to guard against user space software problems and if there's a kill then it
would catch it. Also again in you want to check against all corruptions you
have to do it in the low level handler or better watch corrected
events too to predict failures (but the later is quite hard to do generally). 
To some extent the first is already implemented on x86, e.g. set
the tolerance level to 0 will give more aggressive panics.

> I can't see a good reason to rush to merge it.

The low level x86 code for MCA recovery is in, just this high level
part is missing to kill the correct process. I think it would be good to merge 
a core now.  The basic code seems to be also as well tested as we can do it 
right now and exposing it to more users would be good. It's undoubtedly not 
perfect yet, but that's not a requirement for merge.

There's a lot of fancy stuff that could be done in addition,
but that's not really needed right now and for a lot of the fancy
ideas (I have enough on my own :) it's dubious they are actually
worth it.

> IMO the userspace-visible changes have maybe not been considered
> too thoroughly, which is what I'd be most worried about. I probably
> missed seeing documentation of exact semantics and situations
> where admins should tune things one way or the other.

There's only a single tunable anyways, early kill vs late kill.

For KVM you need early kill, for the others it remains to be seen.

> I hope it is going to be merged with an easy-to-use fault injector,
> because that is the only way Joe kernel developer is ever going to
> test it.

See patches 13 and 14. In addition there's another low level x86
injector too.

There's also a test suite available (mce-test on kernel.org git)

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  7:09       ` Andi Kleen
@ 2009-06-15  7:19         ` Nick Piggin
  2009-06-15 12:10           ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2009-06-15  7:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 09:09:14AM +0200, Andi Kleen wrote:
> On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > 
> > > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> > > 
> > > It comes a long way through numerous reviews, and I believe all the
> > > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > > your request to "die hard" or "die gracefully" or whatever on memory
> > > failure events?
> > 
> > Uevent? As in, send a message to userspace? I don't think this
> > would be ideal for a fail-stop/failover situation.
> 
> Agreed.
> 
> For failover you typically want a application level heartbeat anyways
> to guard against user space software problems and if there's a kill then it
> would catch it. Also again in you want to check against all corruptions you
> have to do it in the low level handler or better watch corrected
> events too to predict failures (but the later is quite hard to do generally). 
> To some extent the first is already implemented on x86, e.g. set
> the tolerance level to 0 will give more aggressive panics.
> 
> > I can't see a good reason to rush to merge it.
> 
> The low level x86 code for MCA recovery is in, just this high level
> part is missing to kill the correct process. I think it would be good to merge 
> a core now.  The basic code seems to be also as well tested as we can do it 
> right now and exposing it to more users would be good. It's undoubtedly not 
> perfect yet, but that's not a requirement for merge.
> 
> There's a lot of fancy stuff that could be done in addition,
> but that's not really needed right now and for a lot of the fancy
> ideas (I have enough on my own :) it's dubious they are actually
> worth it.

Just my opinion. Normally it takes a lot longer for VM patches
like this to go through, but it's not really up to me anyway.
If Andrew or Linus has it in their head to merge it in 2.6.31,
it's going to get merged ;) 

 
> > IMO the userspace-visible changes have maybe not been considered
> > too thoroughly, which is what I'd be most worried about. I probably
> > missed seeing documentation of exact semantics and situations
> > where admins should tune things one way or the other.
> 
> There's only a single tunable anyways, early kill vs late kill.
> 
> For KVM you need early kill, for the others it remains to be seen.

Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.

Early-kill for KVM does seem like reasonable justification on the
surface, but when I think more about it, I wonder does the guest
actually stand any better chance to correct the error if it is
reported at time T rather than T+delta? (who knows what the page
will be used for at any given time).


> > I hope it is going to be merged with an easy-to-use fault injector,
> > because that is the only way Joe kernel developer is ever going to
> > test it.
> 
> See patches 13 and 14. In addition there's another low level x86
> injector too.
> 
> There's also a test suite available (mce-test on kernel.org git)

OK, I probably saw some "FOR TETSING" things and wrongly assumed the
injector was not supposed to be merged.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  6:44     ` Nick Piggin
  2009-06-15  7:09       ` Andi Kleen
@ 2009-06-15  8:14       ` Nick Piggin
  2009-06-15 10:09         ` Wu Fengguang
  1 sibling, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2009-06-15  8:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Balbir Singh, Andrew Morton, LKML, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> Did we verify with filesystem maintainers (eg. btrfs) that the
> !ISREG test will be enough to prevent oopses?

BTW. this is quite a significant change I think and not
really documented well enough. Previously a filesystem
will know exactly when and why pagecache in a mapping
under its control will be truncated (as opposed to
invalidated).

They even have opportunity to hold locks such as i_mutex.

And depending on what they do, they could do interesting
things even with ISREG files.

So, I really think this needs review by filesystem
maintainers and it would be far safer to use invalidate
until it is known to be safe.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-15  2:45 ` [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2 Wu Fengguang
@ 2009-06-15  9:41   ` KAMEZAWA Hiroyuki
  2009-06-15 10:16     ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-15  9:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Mon, 15 Jun 2009 10:45:30 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> From: Wu Fengguang <fengguang.wu@intel.com>
> 
> If memory corruption hits the free buddy pages, we can safely ignore them.
> No one will access them until page allocation time, then prep_new_page()
> will automatically check and isolate PG_hwpoison page for us (for 0-order
> allocation).
> 
> This patch expands prep_new_page() to check every component page in a high
> order page allocation, in order to completely stop PG_hwpoison pages from
> being recirculated.
> 
> Note that the common case -- only allocating a single page, doesn't
> do any more work than before. Allocating > order 0 does a bit more work,
> but that's relatively uncommon.
> 
> This simple implementation may drop some innocent neighbor pages, hopefully
> it is not a big problem because the event should be rare enough.
> 
> This patch adds some runtime costs to high order page users.
> 
> [AK: Improved description]
> 
> v2: Andi Kleen:
> Port to -mm code
> Move check into separate function.
> Don't dump stack in bad_pages for hwpoisoned pages.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  mm/page_alloc.c |   20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> --- sound-2.6.orig/mm/page_alloc.c
> +++ sound-2.6/mm/page_alloc.c
> @@ -233,6 +233,12 @@ static void bad_page(struct page *page)
>  	static unsigned long nr_shown;
>  	static unsigned long nr_unshown;
>  
> +	/* Don't complain about poisoned pages */
> +	if (PageHWPoison(page)) {
> +		__ClearPageBuddy(page);
> +		return;
> +	}

Hmm ? why __ClearPageBuddy() is necessary ?

Thanks,
-Kame


> +
>  	/*
>  	 * Allow a burst of 60 reports, then keep quiet for that minute;
>  	 * or allow a steady drip of one report per second.
> @@ -646,7 +652,7 @@ static inline void expand(struct zone *z
>  /*
>   * This page is about to be returned from the page allocator
>   */
> -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> +static inline int check_new_page(struct page *page)
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> @@ -655,6 +661,18 @@ static int prep_new_page(struct page *pa
>  		bad_page(page);
>  		return 1;
>  	}
> +	return 0;
> +}
> +
> +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> +{
> +	int i;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		struct page *p = page + i;
> +		if (unlikely(check_new_page(p)))
> +			return 1;
> +	}
>  
>  	set_page_private(page, 0);
>  	set_page_refcounted(page);
> 
> -- 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 21/22] HWPOISON: send uevent to report memory corruption
  2009-06-15  6:29   ` Andi Kleen
@ 2009-06-15  9:56     ` Wu Fengguang
  0 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15  9:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins, riel,
	chris.mason, linux-mm

On Mon, Jun 15, 2009 at 02:29:34PM +0800, Andi Kleen wrote:
> On Mon, Jun 15, 2009 at 10:45:41AM +0800, Wu Fengguang wrote:
> > This allows the user space to do some flexible policies.
> > For example, it may either do emergency sync/shutdown
> > or to schedule reboot at some convenient time, depending
> > on the severeness of the corruption.
> > 
> 
> I don't think it's a good idea to export that much detailed information.
> That would become a stable ABI, but might not be possible to keep
> all these details stable. e.g. map count or reference count are
> internal implementation details that shouldn't be exposed.
> And what is an user space application to do with the inode? Run
> find -inum? 

I had plan to export file path too, if available. But you are right,
we should really think twice before exporting them.

> Also we already report the event using low level logging mechanism.
> in a relatively stable form.
> 
> It's also unclear to me what an application would do with that much
> detail.
> 
> I would suggest to drop this part and the earlier flags move.
> 
> Please only bug fixes are this stage.

OK, I'll get rid of the uevent patches numbered 16-21 in the next release.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  8:14       ` [PATCH 00/22] HWPOISON: Intro (v5) Nick Piggin
@ 2009-06-15 10:09         ` Wu Fengguang
  2009-06-15 10:36           ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15 10:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Balbir Singh, Andrew Morton, LKML, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 04:14:53PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > Did we verify with filesystem maintainers (eg. btrfs) that the
> > !ISREG test will be enough to prevent oopses?
> 
> BTW. this is quite a significant change I think and not
> really documented well enough. Previously a filesystem
> will know exactly when and why pagecache in a mapping
> under its control will be truncated (as opposed to
> invalidated).
> 
> They even have opportunity to hold locks such as i_mutex.
> 
> And depending on what they do, they could do interesting
> things even with ISREG files.
> 
> So, I really think this needs review by filesystem
> maintainers and it would be far safer to use invalidate
> until it is known to be safe.

Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
Do you mean to do invalidate_complete_page() for all inodes for now?
That's a good suggestion, it shall be able to do the job for most
pages indeed.

To make things look clear, here is the complete memory-failure.c
without the uevent bits.

Thanks,
Fengguang
---

/*
 * Copyright (C) 2008, 2009 Intel Corporation
 * Authors: Andi Kleen, Fengguang Wu
 *
 * This software may be redistributed and/or modified under the terms of
 * the GNU General Public License ("GPL") version 2 only as published by the
 * Free Software Foundation.
 *
 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focuses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states.	The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * The operation to map back from RMAP chains to processes has to walk
 * the complete process list and has non linear complexity with the number
 * mappings. In short it can be quite slow. But since memory corruptions
 * are rare we hope to get away with this.
 */

/*
 * Notebook:
 * - hugetlb needs more code
 * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
 * - pass bad pages to kdump next kernel
 */
#define DEBUG 1
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/page-flags.h>
#include <linux/sched.h>
#include <linux/rmap.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/backing-dev.h>
#include "internal.h"

int sysctl_memory_failure_early_kill __read_mostly = 1;

atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);

/*
 * Send all the processes who have the page mapped an ``action optional''
 * signal.
 */
static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
			unsigned long pfn)
{
	struct siginfo si;
	int ret;

	printk(KERN_ERR
	"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
		pfn, t->comm, t->pid);
	si.si_signo = SIGBUS;
	si.si_errno = 0;
	si.si_code = BUS_MCEERR_AO;
	si.si_addr = (void *)addr;
#ifdef __ARCH_SI_TRAPNO
	si.si_trapno = trapno;
#endif
	si.si_addr_lsb = PAGE_SHIFT;
	/*
	 * Don't use force here, it's convenient if the signal
	 * can be temporarily blocked.
	 * This could cause a loop when the user sets SIGBUS
	 * to SIG_IGN, but hopefully noone will do that?
	 */
	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
	if (ret < 0)
		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
		       t->comm, t->pid, ret);
	return ret;
}

/*
 * Kill all processes that have a poisoned page mapped and then isolate
 * the page.
 *
 * General strategy:
 * Find all processes having the page mapped and kill them.
 * But we keep a page reference around so that the page is not
 * actually freed yet.
 * Then stash the page away
 *
 * There's no convenient way to get back to mapped processes
 * from the VMAs. So do a brute-force search over all
 * running processes.
 *
 * Remember that machine checks are not common (or rather
 * if they are common you have other problems), so this shouldn't
 * be a performance issue.
 *
 * Also there are some races possible while we get from the
 * error detection to actually handle it.
 */

struct to_kill {
	struct list_head nd;
	struct task_struct *tsk;
	unsigned long addr;
	unsigned addr_valid:1;
};

/*
 * Failure handling: if we can't find or can't kill a process there's
 * not much we can do.	We just print a message and ignore otherwise.
 */

/*
 * Schedule a process for later kill.
 */
static void add_to_kill(struct task_struct *tsk, struct page *p,
			struct vm_area_struct *vma,
			struct list_head *to_kill,
			struct to_kill **tkc)
{
	struct to_kill *tk;

	if (*tkc) {
		tk = *tkc;
		*tkc = NULL;
	} else {
		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
		if (!tk) {
			printk(KERN_ERR
		"MCE: Out of memory while machine check handling\n");
			return;
		}
	}
	tk->addr = page_address_in_vma(p, vma);
	tk->addr_valid = 1;

	/*
	 * In theory we don't have to kill when the page was
	 * munmaped. But it could be also a mremap. Since that's
	 * likely very rare kill anyways just out of paranoia, but use
	 * a SIGKILL because the error is not contained anymore.
	 */
	if (tk->addr == -EFAULT) {
		pr_debug("MCE: Unable to find user space address %lx in %s\n",
			page_to_pfn(p), tsk->comm);
		tk->addr_valid = 0;
	}
	get_task_struct(tsk);
	tk->tsk = tsk;
	list_add_tail(&tk->nd, to_kill);
}

/*
 * Kill the processes that have been collected earlier.
 *
 * Only do anything when DOIT is set, otherwise just free the list
 * (this is used for clean pages which do not need killing)
 * Also when FAIL is set do a force kill because something went
 * wrong earlier.
 */
static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
			  int fail, unsigned long pfn)
{
	struct to_kill *tk, *next;

	list_for_each_entry_safe (tk, next, to_kill, nd) {
		if (doit) {
			/*
			 * In case something went wrong with munmaping
			 * make sure the process doesn't catch the
			 * signal and then access the memory. Just kill it.
			 * the signal handlers
			 */
			if (fail || tk->addr_valid == 0) {
				printk(KERN_ERR
"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
					pfn, tk->tsk->comm, tk->tsk->pid);
				force_sig(SIGKILL, tk->tsk);
			}

			/*
			 * In theory the process could have mapped
			 * something else on the address in-between. We could
			 * check for that, but we need to tell the
			 * process anyways.
			 */
			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
					      pfn) < 0)
				printk(KERN_ERR
	"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
					pfn, tk->tsk->comm, tk->tsk->pid);
		}
		put_task_struct(tk->tsk);
		kfree(tk);
	}
}

/*
 * Collect processes when the error hit an anonymous page.
 */
static void collect_procs_anon(struct page *page, struct list_head *to_kill,
			       struct to_kill **tkc)
{
	struct vm_area_struct *vma;
	struct task_struct *tsk;
	struct anon_vma *av;

	read_lock(&tasklist_lock);

	av = page_lock_anon_vma(page);
	if (av == NULL)	/* Not actually mapped anymore */
		goto out;

	for_each_process (tsk) {
		if (!tsk->mm)
			continue;
		list_for_each_entry (vma, &av->head, anon_vma_node) {
			if (!page_mapped_in_vma(page, vma))
				continue;

			if (vma->vm_mm == tsk->mm)
				add_to_kill(tsk, page, vma, to_kill, tkc);
		}
	}
	page_unlock_anon_vma(av);
out:
	read_unlock(&tasklist_lock);
}

/*
 * Collect processes when the error hit a file mapped page.
 */
static void collect_procs_file(struct page *page, struct list_head *to_kill,
			       struct to_kill **tkc)
{
	struct vm_area_struct *vma;
	struct task_struct *tsk;
	struct prio_tree_iter iter;
	struct address_space *mapping = page->mapping;

	/*
	 * A note on the locking order between the two locks.
	 * We don't rely on this particular order.
	 * If you have some other code that needs a different order
	 * feel free to switch them around. Or add a reverse link
	 * from mm_struct to task_struct, then this could be all
	 * done without taking tasklist_lock and looping over all tasks.
	 */

	read_lock(&tasklist_lock);
	spin_lock(&mapping->i_mmap_lock);
	for_each_process(tsk) {
		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

		if (!tsk->mm)
			continue;

		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
				      pgoff)
			if (vma->vm_mm == tsk->mm)
				add_to_kill(tsk, page, vma, to_kill, tkc);
	}
	spin_unlock(&mapping->i_mmap_lock);
	read_unlock(&tasklist_lock);
}

/*
 * Collect the processes who have the corrupted page mapped to kill.
 */
static void collect_procs(struct page *page, struct list_head *tokill)
{
	struct to_kill *tk;

	/*
	 * First preallocate one to_kill structure outside the spin locks,
	 * so that we can kill at least one process reasonably reliable.
	 */
	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);

	if (PageAnon(page))
		collect_procs_anon(page, tokill, &tk);
	else
		collect_procs_file(page, tokill, &tk);
	kfree(tk);
}

/*
 * Error handlers for various types of pages.
 */

enum outcome {
	FAILED,		/* Error handling failed */
	DELAYED,	/* Will be handled later */
	IGNORED,	/* Error safely ignored */
	RECOVERED,	/* Successfully recovered */
};

static const char *action_name[] = {
	[FAILED] = "Failed",
	[DELAYED] = "Delayed",
	[IGNORED] = "Ignored",
	[RECOVERED] = "Recovered",
};

/*
 * Error hit kernel page.
 * Do nothing, try to be lucky and not touch this instead. For a few cases we
 * could be more sophisticated.
 */
static int me_kernel(struct page *p, unsigned long pfn)
{
	return DELAYED;
}

/*
 * Already poisoned page.
 */
static int me_ignore(struct page *p, unsigned long pfn)
{
	return IGNORED;
}

/*
 * Page in unknown state. Do nothing.
 */
static int me_unknown(struct page *p, unsigned long pfn)
{
	printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
	return FAILED;
}

/*
 * Free memory
 */
static int me_free(struct page *p, unsigned long pfn)
{
	return DELAYED;
}

/*
 * Clean (or cleaned) page cache page.
 */
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
	struct address_space *mapping;

	if (!isolate_lru_page(p))
		page_cache_release(p);

	mapping = page_mapping(p);
	if (mapping == NULL)
		return RECOVERED;

	/*
	 * Now truncate the page in the page cache. This is really
	 * more like a "temporary hole punch"
	 * Don't do this for block devices when someone else
	 * has a reference, because it could be file system metadata
	 * and that's not safe to truncate.
	 */
	if (!S_ISREG(mapping->host->i_mode) &&
	    !invalidate_complete_page(mapping, p)) {
		printk(KERN_ERR
		       "MCE %#lx: failed to invalidate metadata page\n",
			pfn);
		return FAILED;
	}

	truncate_inode_page(mapping, p);
	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
		pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
			 pfn);
		return FAILED;
	}
	return RECOVERED;
}

/*
 * Dirty cache page page
 * Issues: when the error hit a hole page the error is not properly
 * propagated.
 */
static int me_pagecache_dirty(struct page *p, unsigned long pfn)
{
	struct address_space *mapping = page_mapping(p);

	SetPageError(p);
	/* TBD: print more information about the file. */
	if (mapping) {
		/*
		 * IO error will be reported by write(), fsync(), etc.
		 * who check the mapping.
		 * This way the application knows that something went
		 * wrong with its dirty file data.
		 *
		 * There's one open issue:
		 *
		 * The EIO will be only reported on the next IO
		 * operation and then cleared through the IO map.
		 * Normally Linux has two mechanisms to pass IO error
		 * first through the AS_EIO flag in the address space
		 * and then through the PageError flag in the page.
		 * Since we drop pages on memory failure handling the
		 * only mechanism open to use is through AS_AIO.
		 *
		 * This has the disadvantage that it gets cleared on
		 * the first operation that returns an error, while
		 * the PageError bit is more sticky and only cleared
		 * when the page is reread or dropped.  If an
		 * application assumes it will always get error on
		 * fsync, but does other operations on the fd before
		 * and the page is dropped inbetween then the error
		 * will not be properly reported.
		 *
		 * This can already happen even without hwpoisoned
		 * pages: first on metadata IO errors (which only
		 * report through AS_EIO) or when the page is dropped
		 * at the wrong time.
		 *
		 * So right now we assume that the application DTRT on
		 * the first EIO, but we're not worse than other parts
		 * of the kernel.
		 */
		mapping_set_error(mapping, EIO);
	}

	return me_pagecache_clean(p, pfn);
}

/*
 * Clean and dirty swap cache.
 *
 * Dirty swap cache page is tricky to handle. The page could live both in page
 * cache and swap cache(ie. page is freshly swapped in). So it could be
 * referenced concurrently by 2 types of PTEs:
 * normal PTEs and swap PTEs. We try to handle them consistently by calling u
 * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
 * and then
 *      - clear dirty bit to prevent IO
 *      - remove from LRU
 *      - but keep in the swap cache, so that when we return to it on
 *        a later page fault, we know the application is accessing
 *        corrupted data and shall be killed (we installed simple
 *        interception code in do_swap_page to catch it).
 *
 * Clean swap cache pages can be directly isolated. A later page fault will
 * bring in the known good data from disk.
 */
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
	ClearPageDirty(p);
	/* Trigger EIO in shmem: */
	ClearPageUptodate(p);

	if (!isolate_lru_page(p))
		page_cache_release(p);

	return DELAYED;
}

static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
	if (!isolate_lru_page(p))
		page_cache_release(p);

	delete_from_swap_cache(p);

	return RECOVERED;
}

/*
 * Huge pages. Needs work.
 * Issues:
 * No rmap support so we cannot find the original mapper. In theory could walk
 * all MMs and look for the mappings, but that would be non atomic and racy.
 * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
 * like just walking the current process and hoping it has it mapped (that
 * should be usually true for the common "shared database cache" case)
 * Should handle free huge pages and dequeue them too, but this needs to
 * handle huge page accounting correctly.
 */
static int me_huge_page(struct page *p, unsigned long pfn)
{
	return FAILED;
}

/*
 * Various page states we can handle.
 *
 * A page state is defined by its current page->flags bits.
 * The table matches them in order and calls the right handler.
 *
 * This is quite tricky because we can access page at any time
 * in its live cycle, so all accesses have to be extremly careful.
 *
 * This is not complete. More states could be added.
 * For any missing state don't attempt recovery.
 */

#define dirty		(1UL << PG_dirty)
#define sc		(1UL << PG_swapcache)
#define unevict		(1UL << PG_unevictable)
#define mlock		(1UL << PG_mlocked)
#define writeback	(1UL << PG_writeback)
#define lru		(1UL << PG_lru)
#define swapbacked	(1UL << PG_swapbacked)
#define head		(1UL << PG_head)
#define tail		(1UL << PG_tail)
#define compound	(1UL << PG_compound)
#define slab		(1UL << PG_slab)
#define buddy		(1UL << PG_buddy)
#define reserved	(1UL << PG_reserved)

static struct page_state {
	unsigned long mask;
	unsigned long res;
	char *msg;
	int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
	{ reserved,	reserved,	"reserved kernel",	me_ignore },
	{ buddy,	buddy,		"free kernel",	me_free },

	/*
	 * Could in theory check if slab page is free or if we can drop
	 * currently unused objects without touching them. But just
	 * treat it as standard kernel for now.
	 */
	{ slab,		slab,		"kernel slab",	me_kernel },

#ifdef CONFIG_PAGEFLAGS_EXTENDED
	{ head,		head,		"huge",		me_huge_page },
	{ tail,		tail,		"huge",		me_huge_page },
#else
	{ compound,	compound,	"huge",		me_huge_page },
#endif

	{ sc|dirty,	sc|dirty,	"swapcache",	me_swapcache_dirty },
	{ sc|dirty,	sc,		"swapcache",	me_swapcache_clean },

#ifdef CONFIG_UNEVICTABLE_LRU
	{ unevict|dirty, unevict|dirty,	"unevictable LRU", me_pagecache_dirty},
	{ unevict,	unevict,	"unevictable LRU", me_pagecache_clean},
#endif

#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
	{ mlock|dirty,	mlock|dirty,	"mlocked LRU",	me_pagecache_dirty },
	{ mlock,	mlock,		"mlocked LRU",	me_pagecache_clean },
#endif

	{ lru|dirty,	lru|dirty,	"LRU",		me_pagecache_dirty },
	{ lru|dirty,	lru,		"clean LRU",	me_pagecache_clean },
	{ swapbacked,	swapbacked,	"anonymous",	me_pagecache_clean },

	/*
	 * Catchall entry: must be at end.
	 */
	{ 0,		0,		"unknown page state",	me_unknown },
};

static void action_result(unsigned long pfn, char *msg, int result)
{
	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
		pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
		msg, action_name[result]);
}

static void page_action(struct page_state *ps, struct page *p,
			unsigned long pfn)
{
	int result;

	result = ps->action(p, pfn);
	action_result(pfn, ps->msg, result);
	if (page_count(p) != 1)
		printk(KERN_ERR
		       "MCE %#lx: %s page still referenced by %d users\n",
		       pfn, ps->msg, page_count(p) - 1);

	/* Could do more checks here if page looks ok */
	atomic_long_add(1, &mce_bad_pages);

	/*
	 * Could adjust zone counters here to correct for the missing page.
	 */
}

#define N_UNMAP_TRIES 5

/*
 * Do all that is necessary to remove user space mappings. Unmap
 * the pages and send SIGBUS to the processes if the data was dirty.
 */
static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
				  int trapno)
{
	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
	int kill = sysctl_memory_failure_early_kill;
	struct address_space *mapping;
	LIST_HEAD(tokill);
	int ret;
	int i;

	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
		return;

	if (!PageLRU(p))
		lru_add_drain();

	/*
	 * This check implies we don't kill processes if their pages
	 * are in the swap cache early. Those are always late kills.
	 */
	if (!page_mapped(p))
		return;

	if (PageSwapCache(p)) {
		printk(KERN_ERR
		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
		ttu |= TTU_IGNORE_HWPOISON;
	}

	/*
	 * Propagate the dirty bit from PTEs to struct page first, because we
	 * need this to decide if we should kill or just drop the page.
	 */
	mapping = page_mapping(p);
	if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
		if (page_mkclean(p))
			SetPageDirty(p);
		else {
			kill = 0;
			ttu |= TTU_IGNORE_HWPOISON;
			printk(KERN_INFO
	"MCE %#lx: corrupted page was clean: dropped without side effects\n",
				pfn);
		}
	}

	/*
	 * First collect all the processes that have the page
	 * mapped.  This has to be done before try_to_unmap,
	 * because ttu takes the rmap data structures down.
	 *
	 * This also has the side effect to propagate the dirty
	 * bit from PTEs into the struct page. This is needed
	 * to actually decide if something needs to be killed
	 * or errored, or if it's ok to just drop the page.
	 *
	 * Error handling: We ignore errors here because
	 * there's nothing that can be done.
	 */
	if (kill && p->mapping)
		collect_procs(p, &tokill);

	/*
	 * try_to_unmap can fail temporarily due to races.
	 * Try a few times (RED-PEN better strategy?)
	 */
	for (i = 0; i < N_UNMAP_TRIES; i++) {
		ret = try_to_unmap(p, ttu);
		if (ret == SWAP_SUCCESS)
			break;
		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
	}

	if (ret != SWAP_SUCCESS)
		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
				pfn, page_mapcount(p));

	/*
	 * Now that the dirty bit has been propagated to the
	 * struct page and all unmaps done we can decide if
	 * killing is needed or not.  Only kill when the page
	 * was dirty, otherwise the tokill list is merely
	 * freed.  When there was a problem unmapping earlier
	 * use a more force-full uncatchable kill to prevent
	 * any accesses to the poisoned memory.
	 */
	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
		      ret != SWAP_SUCCESS, pfn);
}

/**
 * memory_failure - Handle memory failure of a page.
 * @pfn: Page Number of the corrupted page
 * @trapno: Trap number reported in the signal to user space.
 *
 * This function is called by the low level machine check code
 * of an architecture when it detects hardware memory corruption
 * of a page. It tries its best to recover, which includes
 * dropping pages, killing processes etc.
 *
 * The function is primarily of use for corruptions that
 * happen outside the current execution context (e.g. when
 * detected by a background scrubber)
 *
 * Must run in process context (e.g. a work queue) with interrupts
 * enabled and no spinlocks hold.
 */
void memory_failure(unsigned long pfn, int trapno)
{
	struct page_state *ps;
	struct page *p;

	if (!pfn_valid(pfn)) {
		printk(KERN_ERR
		       "MCE %#lx: memory outside kernel control: Ignored\n",
		       pfn);
		return;
	}

	p = pfn_to_page(pfn);
	if (TestSetPageHWPoison(p)) {
		action_result(pfn, "already hardware poisoned", IGNORED);
		return;
	}

	/*
	 * We need/can do nothing about count=0 pages.
	 * 1) it's a free page, and therefore in safe hand:
	 *    prep_new_page() will be the gate keeper.
	 * 2) it's part of a non-compound high order page.
	 *    Implies some kernel user: cannot stop them from
	 *    R/W the page; let's pray that the page has been
	 *    used and will be freed some time later.
	 * In fact it's dangerous to directly bump up page count from 0,
	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
	 */
	if (!get_page_unless_zero(compound_head(p))) {
		action_result(pfn, "free or high order kernel", IGNORED);
		return;
	}

	/*
	 * Lock the page and wait for writeback to finish.
	 * It's very difficult to mess with pages currently under IO
	 * and in many cases impossible, so we just avoid it here.
	 */
	lock_page_nosync(p);
	wait_on_page_writeback(p);

	/*
	 * Now take care of user space mappings.
	 */
	hwpoison_user_mappings(p, pfn, trapno);

	/*
	 * Torn down by someone else?
	 */
	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
		action_result(pfn, "already truncated LRU", IGNORED);
		goto out;
	}

	for (ps = error_states;; ps++) {
		if ((p->flags & ps->mask) == ps->res) {
			page_action(ps, p, pfn);
			break;
		}
	}
out:
	unlock_page(p);
}

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-15  9:41   ` KAMEZAWA Hiroyuki
@ 2009-06-15 10:16     ` Wu Fengguang
  2009-06-15 23:52       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15 10:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 05:41:12PM +0800, KAMEZAWA Hiroyuki wrote:
> On Mon, 15 Jun 2009 10:45:30 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > If memory corruption hits the free buddy pages, we can safely ignore them.
> > No one will access them until page allocation time, then prep_new_page()
> > will automatically check and isolate PG_hwpoison page for us (for 0-order
> > allocation).
> > 
> > This patch expands prep_new_page() to check every component page in a high
> > order page allocation, in order to completely stop PG_hwpoison pages from
> > being recirculated.
> > 
> > Note that the common case -- only allocating a single page, doesn't
> > do any more work than before. Allocating > order 0 does a bit more work,
> > but that's relatively uncommon.
> > 
> > This simple implementation may drop some innocent neighbor pages, hopefully
> > it is not a big problem because the event should be rare enough.
> > 
> > This patch adds some runtime costs to high order page users.
> > 
> > [AK: Improved description]
> > 
> > v2: Andi Kleen:
> > Port to -mm code
> > Move check into separate function.
> > Don't dump stack in bad_pages for hwpoisoned pages.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > ---
> >  mm/page_alloc.c |   20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> > 
> > --- sound-2.6.orig/mm/page_alloc.c
> > +++ sound-2.6/mm/page_alloc.c
> > @@ -233,6 +233,12 @@ static void bad_page(struct page *page)
> >  	static unsigned long nr_shown;
> >  	static unsigned long nr_unshown;
> >  
> > +	/* Don't complain about poisoned pages */
> > +	if (PageHWPoison(page)) {
> > +		__ClearPageBuddy(page);
> > +		return;
> > +	}
> 
> Hmm ? why __ClearPageBuddy() is necessary ?

Because this page is considered to be "allocated" out of the buddy
system, even though we fail the allocation here.

The page is now owned by no one, especially not owned by the buddy
allocator.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 10:09         ` Wu Fengguang
@ 2009-06-15 10:36           ` Nick Piggin
  2009-06-15 11:41             ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2009-06-15 10:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Balbir Singh, Andrew Morton, LKML, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 06:09:54PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 04:14:53PM +0800, Nick Piggin wrote:
> > On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > Did we verify with filesystem maintainers (eg. btrfs) that the
> > > !ISREG test will be enough to prevent oopses?
> > 
> > BTW. this is quite a significant change I think and not
> > really documented well enough. Previously a filesystem
> > will know exactly when and why pagecache in a mapping
> > under its control will be truncated (as opposed to
> > invalidated).
> > 
> > They even have opportunity to hold locks such as i_mutex.
> > 
> > And depending on what they do, they could do interesting
> > things even with ISREG files.
> > 
> > So, I really think this needs review by filesystem
> > maintainers and it would be far safer to use invalidate
> > until it is known to be safe.
> 
> Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
> Do you mean to do invalidate_complete_page() for all inodes for now?

That would make me a lot happier. It is obviously correct because
that is basically what page reclaim and inode reclaim and drop
caches etc does.

Note that I still don't like exporting invalidate_complete_page
fro the same reasons I don't like exporting truncate_complete_page,
so I will ask if you can do an invalidate_inode_page function
along the same lines of the truncate_inode_page one please.


> That's a good suggestion, it shall be able to do the job for most
> pages indeed.

Yes I think it will be far far safer while only introducing
another small class of pages which cannot be recovered (probably
a much smaller set most of the time than the size of the existing
set of pages which cannot be recovered).



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 10:36           ` Nick Piggin
@ 2009-06-15 11:41             ` Wu Fengguang
  0 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15 11:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Balbir Singh, Andrew Morton, LKML, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 06:36:19PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 06:09:54PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 04:14:53PM +0800, Nick Piggin wrote:
> > > On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > > Did we verify with filesystem maintainers (eg. btrfs) that the
> > > > !ISREG test will be enough to prevent oopses?
> > > 
> > > BTW. this is quite a significant change I think and not
> > > really documented well enough. Previously a filesystem
> > > will know exactly when and why pagecache in a mapping
> > > under its control will be truncated (as opposed to
> > > invalidated).
> > > 
> > > They even have opportunity to hold locks such as i_mutex.
> > > 
> > > And depending on what they do, they could do interesting
> > > things even with ISREG files.
> > > 
> > > So, I really think this needs review by filesystem
> > > maintainers and it would be far safer to use invalidate
> > > until it is known to be safe.
> > 
> > Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
> > Do you mean to do invalidate_complete_page() for all inodes for now?
> 
> That would make me a lot happier. It is obviously correct because
> that is basically what page reclaim and inode reclaim and drop
> caches etc does.
> 
> Note that I still don't like exporting invalidate_complete_page
> fro the same reasons I don't like exporting truncate_complete_page,
> so I will ask if you can do an invalidate_inode_page function
> along the same lines of the truncate_inode_page one please.

Sure. I did something radical - don't try to isolate dirty/writeback
pages, to match the exact invalidate_mapping_pages() behavior.

Let's mess with the dirty/writeback pages some time later.

+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+       struct address_space *mapping;
+
+       if (!isolate_lru_page(p))
+               page_cache_release(p);
+
+       mapping = page_mapping(p);
+       if (mapping == NULL)
+               return RECOVERED;
+
+       /*
+        * Now remove it from page cache.
+        * Currently we only remove clean, unused page for the sake of safety.
+        */
+       if (!invalidate_inode_page(mapping, p)) {
+               printk(KERN_ERR
+                      "MCE %#lx: failed to remove from page cache\n", pfn);
+               return FAILED;
+       }
+       return RECOVERED;
+}


--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,21 @@ invalidate_complete_page(struct address_
        return ret;
 }

+/*
+ * Safely invalidate one page from its pagecache mapping.
+ * It only drops clean, unused pages. The page must be locked.
+ *
+ * Returns 1 if the page is successfully invalidated, otherwise 0.
+ */
+int invalidate_inode_page(struct address_space *mapping, struct page *page)
+{
+       if (PageDirty(page) || PageWriteback(page))
+               return 0;
+       if (page_mapped(page))
+               return 0;
+       return invalidate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -311,12 +326,8 @@ unsigned long invalidate_mapping_pages(s
                        if (lock_failed)
                                continue;

-                       if (PageDirty(page) || PageWriteback(page))
-                               goto unlock;
-                       if (page_mapped(page))
-                               goto unlock;
-                       ret += invalidate_complete_page(mapping, page);
-unlock:
+                       ret += invalidate_inode_page(mapping, page);
+                       
                        unlock_page(page);
                        if (next > end)
                                break;

> 
> > That's a good suggestion, it shall be able to do the job for most
> > pages indeed.
> 
> Yes I think it will be far far safer while only introducing
> another small class of pages which cannot be recovered (probably
> a much smaller set most of the time than the size of the existing
> set of pages which cannot be recovered).

Anyway, dirty pages are limited to 15% of the total memory by default. 

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  7:19         ` Nick Piggin
@ 2009-06-15 12:10           ` Wu Fengguang
  2009-06-15 12:25             ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15 12:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 09:09:14AM +0200, Andi Kleen wrote:
> > On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > > 
> > > > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > > > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> > > > 
> > > > It comes a long way through numerous reviews, and I believe all the
> > > > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > > > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > > > your request to "die hard" or "die gracefully" or whatever on memory
> > > > failure events?
> > > 
> > > Uevent? As in, send a message to userspace? I don't think this
> > > would be ideal for a fail-stop/failover situation.
> > 
> > Agreed.
> > 
> > For failover you typically want a application level heartbeat anyways
> > to guard against user space software problems and if there's a kill then it
> > would catch it. Also again in you want to check against all corruptions you
> > have to do it in the low level handler or better watch corrected
> > events too to predict failures (but the later is quite hard to do generally). 
> > To some extent the first is already implemented on x86, e.g. set
> > the tolerance level to 0 will give more aggressive panics.
> > 
> > > I can't see a good reason to rush to merge it.
> > 
> > The low level x86 code for MCA recovery is in, just this high level
> > part is missing to kill the correct process. I think it would be good to merge 
> > a core now.  The basic code seems to be also as well tested as we can do it 
> > right now and exposing it to more users would be good. It's undoubtedly not 
> > perfect yet, but that's not a requirement for merge.
> > 
> > There's a lot of fancy stuff that could be done in addition,
> > but that's not really needed right now and for a lot of the fancy
> > ideas (I have enough on my own :) it's dubious they are actually
> > worth it.
> 
> Just my opinion. Normally it takes a lot longer for VM patches
> like this to go through, but it's not really up to me anyway.
> If Andrew or Linus has it in their head to merge it in 2.6.31,
> it's going to get merged ;) 
> 
>  
> > > IMO the userspace-visible changes have maybe not been considered
> > > too thoroughly, which is what I'd be most worried about. I probably
> > > missed seeing documentation of exact semantics and situations
> > > where admins should tune things one way or the other.
> > 
> > There's only a single tunable anyways, early kill vs late kill.
> > 
> > For KVM you need early kill, for the others it remains to be seen.
> 
> Right. It's almost like you need to do a per-process thing, and
> those that can handle things (such as the new SIGBUS or the new
> EIO) could get those, and others could be killed.

To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
don't mind they can understand that signal at all.

> Early-kill for KVM does seem like reasonable justification on the
> surface, but when I think more about it, I wonder does the guest
> actually stand any better chance to correct the error if it is
> reported at time T rather than T+delta? (who knows what the page
> will be used for at any given time).

Early kill makes a lot difference for KVM.  Think about the vast
amount of clean page cache pages. With early kill the page can be
trivially isolated. With late kill the whole virtual machine dies
hard.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 12:10           ` Wu Fengguang
@ 2009-06-15 12:25             ` Nick Piggin
  2009-06-15 14:22               ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2009-06-15 12:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > For KVM you need early kill, for the others it remains to be seen.
> > 
> > Right. It's almost like you need to do a per-process thing, and
> > those that can handle things (such as the new SIGBUS or the new
> > EIO) could get those, and others could be killed.
> 
> To send early SIGBUS kills to processes who has called
> sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
> don't mind they can understand that signal at all.

For apps that hook into SIGBUS for some other means and
do not understand the new type of SIGBUS signal? What about
those?

 
> > Early-kill for KVM does seem like reasonable justification on the
> > surface, but when I think more about it, I wonder does the guest
> > actually stand any better chance to correct the error if it is
> > reported at time T rather than T+delta? (who knows what the page
> > will be used for at any given time).
> 
> Early kill makes a lot difference for KVM.  Think about the vast
> amount of clean page cache pages. With early kill the page can be
> trivially isolated. With late kill the whole virtual machine dies
> hard.

Why? In both cases it will enter the exception handler and
attempt to do something about it... in both cases I would
have thought there is some chance that the page error is not
recoverable and some chance it is recoverable. Or am I
missing something?

Anyway, I would like to see a basic analysis of those probabilities
to justify early kill. Not saying there is no justification, but
it would be helpful to see why.

Thanks,
Nick



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15  4:27   ` Wu Fengguang
  2009-06-15  6:44     ` Nick Piggin
@ 2009-06-15 12:51     ` Hugh Dickins
  2009-06-15 13:00       ` Alan Cox
  1 sibling, 1 reply; 79+ messages in thread
From: Hugh Dickins @ 2009-06-15 12:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Balbir Singh, Andrew Morton, LKML, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Andi Kleen, riel, chris.mason, linux-mm

On Mon, 15 Jun 2009, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 11:18:18AM +0800, Balbir Singh wrote:
> > Wu Fengguang wrote:
> > > 
> > > I hope we can reach consensus in this round and then be able to post
> > > a final version for .31 inclusion.
> > 
> > Isn't that too aggressive? .31 is already in the merge window.
> 
> Yes, a bit aggressive. This is a new feature that involves complex logics.
> However it is basically a no-op when there are no memory errors,
> and when memory corruption does occur, it's better to (possibly) panic
> in this code than to panic unconditionally in the absence of this
> feature (as said by Rik).
> 
> So IMHO it's OK for .31 as long as we agree on the user interfaces,
> ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> 
> It comes a long way through numerous reviews, and I believe all the
> important issues and concerns have been addressed. Nick, Rik, Hugh,
> Ingo, ... what are your opinions?

And for how long has this work been in linux-next or mmotm?

My opinion is that it's way too late for .31 - your only chance
is that Linus sometimes gets bored with playing safe, and decides
to break his rules and try something out regardless - but I'd hope
the bootmem business already sated his appetite for danger this time.

Hugh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 12:51     ` Hugh Dickins
@ 2009-06-15 13:00       ` Alan Cox
  2009-06-15 13:29         ` Andi Kleen
  0 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2009-06-15 13:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Wu Fengguang, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Andi Kleen, riel, chris.mason, linux-mm

> My opinion is that it's way too late for .31 - your only chance
> is that Linus sometimes gets bored with playing safe, and decides
> to break his rules and try something out regardless - but I'd hope
> the bootmem business already sated his appetite for danger this time.

I see no consensus on it being worth merging, no testing, no upstream
integration shakedown, no builds on non-x86 boxes, no work with other
arch maintainers who have similar abilities and needs.

It belongs in next for a release or two while people work on it, while
things like PPC64 can plumb into it if they wish and while people work
out if its actually useful given that for most users it reduces the
reliability of the services they are providing rather than improving it.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in  try_to_unmap
  2009-06-15  2:45 ` [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap Wu Fengguang
@ 2009-06-15 13:09   ` Minchan Kim
  2009-06-15 15:26     ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-15 13:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> When a page has the poison bit set replace the PTE with a poison entry.
> This causes the right error handling to be done later when a process runs
> into it.
>
> Also add a new flag to not do that (needed for the memory-failure handler
> later)
>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> ---
>  include/linux/rmap.h |    1 +
>  mm/rmap.c            |    9 ++++++++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
>
> --- sound-2.6.orig/mm/rmap.c
> +++ sound-2.6/mm/rmap.c
> @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
>        /* Update high watermark before we lower rss */
>        update_hiwater_rss(mm);
>
> -       if (PageAnon(page)) {
> +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> +               if (PageAnon(page))
> +                       dec_mm_counter(mm, anon_rss);
> +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))

Isn't it straightforward to use !is_hwpoison_entry ?


-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 13:29         ` Andi Kleen
@ 2009-06-15 13:28           ` H. Peter Anvin
  2009-06-15 14:48           ` Alan Cox
  2009-06-16 19:44           ` Russ Anderson
  2 siblings, 0 replies; 79+ messages in thread
From: H. Peter Anvin @ 2009-06-15 13:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	Peter Zijlstra, Nick Piggin, riel, chris.mason, linux-mm

Andi Kleen wrote:
> 
> That said the real tests of this can be only done with test suites
> really, these errors tend to not happen quickly. 
> 

What's far more important, quite frankly, is harmlessness regression
testing.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 13:00       ` Alan Cox
@ 2009-06-15 13:29         ` Andi Kleen
  2009-06-15 13:28           ` H. Peter Anvin
                             ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Andi Kleen @ 2009-06-15 13:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: Hugh Dickins, Wu Fengguang, Balbir Singh, Andrew Morton, LKML,
	Ingo Molnar, Mel Gorman, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Nick Piggin, Andi Kleen, riel, chris.mason,
	linux-mm


I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing. That is why ext3 or block does not default to panic
on each IO error for example. Or oops does not panic by default like
on BSDs. Your argumentation would be good for a traditional early Unix
which likes to panic instead of handling errors, but that's not the
Linux way as I know it.

Also BTW in many cases (e.g. a lot of file cache pages) there 
is actually no process kill involved; just a reload of the page from
disk.

Then for example in a cluster you typically have a application level
heartbeat on the application and restarting the app is faster
if you don't need to reboot the box too. If you don't have a cluster
with failover then gracefull degradation is the best. In general
panic is a very nasty failure mode and should be avoided.

That said you can configure it anyways to panic if you want,
but it would be a very bad default.

See also Linus' or hpa's statement on the topic.

> no testing, 

There's an extensive test suite in mce-test.git 

We did a lot of testing with these separate test suites and also
some other tests. For much more it needs actual users pounding on it, and that 
can be only adequately done in mainline.

That said the real tests of this can be only done with test suites
really, these errors tend to not happen quickly. 

> integration shakedown, no builds on non-x86 boxes, no work with other
> arch maintainers who have similar abilities and needs.

We did build tests on ia64 and power and it was reviewed by Tony for IA64.
The ia64 specific code is not quite ready yet, but will come at some point.

I don't think it's a requirement for merging to have PPC64 support.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 12:25             ` Nick Piggin
@ 2009-06-15 14:22               ` Wu Fengguang
  2009-06-17  6:37                 ` [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15 14:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > For KVM you need early kill, for the others it remains to be seen.
> > > 
> > > Right. It's almost like you need to do a per-process thing, and
> > > those that can handle things (such as the new SIGBUS or the new
> > > EIO) could get those, and others could be killed.
> > 
> > To send early SIGBUS kills to processes who has called
> > sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
> > don't mind they can understand that signal at all.
> 
> For apps that hook into SIGBUS for some other means and

Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.

> do not understand the new type of SIGBUS signal? What about
> those?

We introduced two new SIGBUS codes:
        BUS_MCEERR_AO=5         for early kill
        BUS_MCEERR_AR=4         for late  kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).

We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
  flooding of SIGBUS signals, which is a bug of the application.
  BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
  it may well hurt the same way on BUS_MCEERR_AR. The latter one is
  unavoidable, so the application must be fixed anyway.

>  
> > > Early-kill for KVM does seem like reasonable justification on the
> > > surface, but when I think more about it, I wonder does the guest
> > > actually stand any better chance to correct the error if it is
> > > reported at time T rather than T+delta? (who knows what the page
> > > will be used for at any given time).
> > 
> > Early kill makes a lot difference for KVM.  Think about the vast
> > amount of clean page cache pages. With early kill the page can be
> > trivially isolated. With late kill the whole virtual machine dies
> > hard.
> 
> Why? In both cases it will enter the exception handler and
> attempt to do something about it... in both cases I would
> have thought there is some chance that the page error is not
> recoverable and some chance it is recoverable. Or am I
> missing something?

The early kill / late kill to KVM from the POV of host kernel matches
the MCE AO/AR events inside the KVM guest kernel. The key difference
between AO/AR is, whether the page is _being_ consumed.

It's a lot harder (if possible) to try to stop an active consumer.
For example, the clean cache pages can be consumed in many ways:
- be accessed by read()/write() or mapped read/write
- be reclaimed and then allocated for whatever new usage, for example,
  be zeroed by __GFP_ZERO, or be insert into another file and start
  read/write IO and be accessed by disk driver via DMA, or even be
  allocated for kernel slabs..
Frankly speaking I don't know how to stop all the above consumers.
We now simply die on AR events.

> Anyway, I would like to see a basic analysis of those probabilities
> to justify early kill. Not saying there is no justification, but
> it would be helpful to see why.

That's fine. I'd be glad if the above explanation paves way to
solutions for AR events :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 13:29         ` Andi Kleen
  2009-06-15 13:28           ` H. Peter Anvin
@ 2009-06-15 14:48           ` Alan Cox
  2009-06-15 15:24             ` Andi Kleen
  2009-06-16 19:44           ` Russ Anderson
  2 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2009-06-15 14:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Wu Fengguang, Balbir Singh, Andrew Morton, LKML,
	Ingo Molnar, Mel Gorman, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Nick Piggin, Andi Kleen, riel, chris.mason,
	linux-mm

On Mon, 15 Jun 2009 15:29:34 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> 
> I think you're wrong about killing processes decreasing
> reliability. Traditionally we always tried to keep things running if possible
> instead of panicing. That is why ext3 or block does not default to panic
> on each IO error for example. Or oops does not panic by default like
> on BSDs. Your argumentation would be good for a traditional early Unix
> which likes to panic instead of handling errors, but that's not the
> Linux way as I know it.

Everyone I knew in the business end of deploying Linux turned on panics
for I/O errors, reboot on panic and all the rest of those.

Why ? because they don't want a system where the web server is running
but not logging transactions, or to find out the database is up but that
some other "must not fail" layer killed or stalled the backup server for
it last week ...

The I/O ones can really blow up on you in a reliable environment because
often the process still exists but isn't working so fools much of the
monitoring software.

> That said you can configure it anyways to panic if you want,
> but it would be a very bad default.

That depends for whom

> See also Linus' or hpa's statement on the topic.

Linus doesn't run big server systems. Its a really bad default for
developers. Its probably a bad default for desktop users.

> We did a lot of testing with these separate test suites and also
> some other tests. For much more it needs actual users pounding on it, and that 
> can be only adequately done in mainline.

Thats why we have -next and -mm

> We did build tests on ia64 and power and it was reviewed by Tony for IA64.
> The ia64 specific code is not quite ready yet, but will come at some point.
> 
> I don't think it's a requirement for merging to have PPC64 support.

Really - so if your design is wrong for the way PPC wants to work what
are we going to do ? It's not a requirement that PPC64 support is there
but it is most certainly a requirement that its been in -next a while and
other arch maintainers have at least had time to say "works for me",
"irrelevant to my platform" or "Arghhh noooo.. ECC errors work like
[this] so we need ..."

I'd guess that zSeries has some rather different views on how ECC
failures propogate through the hypervisors for example, including the
fact that a failed page can be unfailed which you don't seem to allow for.

(You can unfail pages on x86 as well it appears by scrubbing them via DMA
- yes ?)


Alan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 14:48           ` Alan Cox
@ 2009-06-15 15:24             ` Andi Kleen
  2009-06-15 15:28               ` Alan Cox
  0 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2009-06-15 15:24 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, riel, chris.mason,
	linux-mm

> Everyone I knew in the business end of deploying Linux turned on panics
> for I/O errors, reboot on panic and all the rest of those.

oops=panic already implies panic on all machine check exceptions, so they will
be fine then (assuming this is the best strategy for availability 
for them, which I personally find quite doubtful, but we can discuss this some 
other time)

> Really - so if your design is wrong for the way PPC wants to work what
> are we going to do ? It's not a requirement that PPC64 support is there

Then we change the code. Or if it's too difficult don't support their stuff.
After all it's not cast in stone. That said I doubt the PPC requirements will 
be much different than what we have.

> I'd guess that zSeries has some rather different views on how ECC
> failures propogate through the hypervisors for example, including the
> fact that a failed page can be unfailed which you don't seem to allow for.

That's correct.

That's because unpoisioning is quite hard -- you need some kind
of synchronization point for all the error handling and that's
the poisoned page and if it unposions itself then you need
some very heavy weight synchronization to avoid handling errors
multiple time. I looked at it, but it's quite messy.

Also it's of somewhat dubious value.

> 
> (You can unfail pages on x86 as well it appears by scrubbing them via DMA
> - yes ?)

Not architectually. Also the other problem is not just unpoisoning them,
but finding out if the page is permenantly bad or just temporarily.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-15 13:09   ` Minchan Kim
@ 2009-06-15 15:26     ` Wu Fengguang
  2009-06-16  0:03       ` Minchan Kim
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-15 15:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> >
> > When a page has the poison bit set replace the PTE with a poison entry.
> > This causes the right error handling to be done later when a process runs
> > into it.
> >
> > Also add a new flag to not do that (needed for the memory-failure handler
> > later)
> >
> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> >
> > ---
> >  include/linux/rmap.h |    1 +
> >  mm/rmap.c            |    9 ++++++++-
> >  2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > --- sound-2.6.orig/mm/rmap.c
> > +++ sound-2.6/mm/rmap.c
> > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> >        /* Update high watermark before we lower rss */
> >        update_hiwater_rss(mm);
> >
> > -       if (PageAnon(page)) {
> > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> > +               if (PageAnon(page))
> > +                       dec_mm_counter(mm, anon_rss);
> > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> 
> Isn't it straightforward to use !is_hwpoison_entry ?

Good catch!  It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that 
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.

So let's do this?

-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else


Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 15:24             ` Andi Kleen
@ 2009-06-15 15:28               ` Alan Cox
  2009-06-15 16:19                 ` Andi Kleen
  0 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2009-06-15 15:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, riel, chris.mason,
	linux-mm

> oops=panic already implies panic on all machine check exceptions, so they will
> be fine then (assuming this is the best strategy for availability 
> for them, which I personally find quite doubtful, but we can discuss this some 
> other time)

You can have the argument with all the people who deploy large systems.
Providing their boxes can be persuaded to panic they don't care about the
masses.

> That's because unpoisioning is quite hard -- you need some kind
> of synchronization point for all the error handling and that's
> the poisoned page and if it unposions itself then you need
> some very heavy weight synchronization to avoid handling errors
> multiple time. I looked at it, but it's quite messy.
> 
> Also it's of somewhat dubious value.

On a system running under a hypervisor or with hot swappable memory its
of rather higher value. In the hypervisor case the guest system can
acquire a new virtual page to replace the faulty one. In fact the
hypervisor case is even more complex as the guest may get migrated at
which point knowledge of "poisoned" memory is not at all connected to
information on hardware failings.

> > 
> > (You can unfail pages on x86 as well it appears by scrubbing them via DMA
> > - yes ?)
> 
> Not architectually. Also the other problem is not just unpoisoning them,
> but finding out if the page is permenantly bad or just temporarily.

Small detail you are overlooking: Hot swap mirrorable memory.

Second detail you are overlooking

	curse a lot
	suspend to disk
	remove dirt from fans, clean/replace RAM
	resume from disk

The very act of making the ECC error not take out the box creates the
environment whereby the underlying hardware error (if there was one) can
be cured.

In all these cases the fact you've got to shoot stuff because a page has
been lost becomes totally disconnected from the idea that the page is
somehow not recoverable and "contaminated" forever.

Which to me says your model is wrong.

Alan

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 15:28               ` Alan Cox
@ 2009-06-15 16:19                 ` Andi Kleen
  2009-06-15 16:28                   ` Alan Cox
  0 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2009-06-15 16:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, riel, chris.mason,
	linux-mm

On Mon, Jun 15, 2009 at 04:28:04PM +0100, Alan "zSeries" Cox wrote:

> 	curse a lot
> 	suspend to disk
> 	remove dirt from fans, clean/replace RAM
> 	resume from disk
> 
> The very act of making the ECC error not take out the box creates the

Ok so at least you agree now that handling these errors without 
panic is the right thing to do. That's at least some progress.

> environment whereby the underlying hardware error (if there was one) can
> be cured.

These ECC errors are still somewhat rare (or rather if they become
common you should definitely service the system). That is why
losing a single page of memory for them isn't a big issue normally.

Sure you could spend effort making unpoisioning work,
but it would seem very dubious to me. After all it's just another
4K of memory for each error. 

The only reasonably good use case I heard for unpoisoning was
if you have a lot of huge pages (you can't use a full huge page with one bad 
small page), but that's also still relatively exotic.

-Andi

[1] mostly you need a new special form of RCU I think

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 16:19                 ` Andi Kleen
@ 2009-06-15 16:28                   ` Alan Cox
  2009-06-15 17:07                     ` Andi Kleen
  0 siblings, 1 reply; 79+ messages in thread
From: Alan Cox @ 2009-06-15 16:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, riel, chris.mason,
	linux-mm

> > 	curse a lot
> > 	suspend to disk
> > 	remove dirt from fans, clean/replace RAM
> > 	resume from disk
> > 
> > The very act of making the ECC error not take out the box creates the
> 
> Ok so at least you agree now that handling these errors without 
> panic is the right thing to do. That's at least some progress.

There are some situations it may be useful - possibly. But then if you
can't sort the resulting mess out because your patches are too limited
its not useful yet is it.

Even then this isn't 2.6.31 stuff - its 2.6.31-mm or -next stuff maybe
for 2.6.32

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 16:28                   ` Alan Cox
@ 2009-06-15 17:07                     ` Andi Kleen
  0 siblings, 0 replies; 79+ messages in thread
From: Andi Kleen @ 2009-06-15 17:07 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, riel, chris.mason,
	linux-mm

> But then if you
> can't sort the resulting mess out because your patches are too limited
> its not useful yet is it.

With "too limited" you refer to unpoisioning?

Again very slowly:

- If you have a lot of errors you die eventually anyways.
- If you have a very low rate of errors (which is the normal case) you don't 
need unpoisioning because the memory lost for each error is miniscule.
- In the case of a hypervisor it's actually not memory lost, but only
guest physical address space, which is plenty on a 64bit system. You can 
eventually replace it by readding memory to a guest, but that's unlikely
to be needed.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-15 10:16     ` Wu Fengguang
@ 2009-06-15 23:52       ` KAMEZAWA Hiroyuki
  2009-06-16  0:34         ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-15 23:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Mon, 15 Jun 2009 18:16:20 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Mon, Jun 15, 2009 at 05:41:12PM +0800, KAMEZAWA Hiroyuki wrote:
> > On Mon, 15 Jun 2009 10:45:30 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > 
> > > If memory corruption hits the free buddy pages, we can safely ignore them.
> > > No one will access them until page allocation time, then prep_new_page()
> > > will automatically check and isolate PG_hwpoison page for us (for 0-order
> > > allocation).
> > > 
> > > This patch expands prep_new_page() to check every component page in a high
> > > order page allocation, in order to completely stop PG_hwpoison pages from
> > > being recirculated.
> > > 
> > > Note that the common case -- only allocating a single page, doesn't
> > > do any more work than before. Allocating > order 0 does a bit more work,
> > > but that's relatively uncommon.
> > > 
> > > This simple implementation may drop some innocent neighbor pages, hopefully
> > > it is not a big problem because the event should be rare enough.
> > > 
> > > This patch adds some runtime costs to high order page users.
> > > 
> > > [AK: Improved description]
> > > 
> > > v2: Andi Kleen:
> > > Port to -mm code
> > > Move check into separate function.
> > > Don't dump stack in bad_pages for hwpoisoned pages.
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > 
> > > ---
> > >  mm/page_alloc.c |   20 +++++++++++++++++++-
> > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > > 
> > > --- sound-2.6.orig/mm/page_alloc.c
> > > +++ sound-2.6/mm/page_alloc.c
> > > @@ -233,6 +233,12 @@ static void bad_page(struct page *page)
> > >  	static unsigned long nr_shown;
> > >  	static unsigned long nr_unshown;
> > >  
> > > +	/* Don't complain about poisoned pages */
> > > +	if (PageHWPoison(page)) {
> > > +		__ClearPageBuddy(page);
> > > +		return;
> > > +	}
> > 
> > Hmm ? why __ClearPageBuddy() is necessary ?
> 
> Because this page is considered to be "allocated" out of the buddy
> system, even though we fail the allocation here.
> 
> The page is now owned by no one, especially not owned by the buddy
> allocator.
> 
I just wonder "why __ClearPageBuddy() is necessary."

When bad_page() is called,  a page is removed from buddy allocator and no
PG_buddy flag at all....I'm sorry if you added bad_page() caller in buddy allocator.

Buddy Allocator I call here is just 2 functions.
 - __free_one_page()
 - expand()


Bye,
-Kame



> Thanks,
> Fengguang
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-15 15:26     ` Wu Fengguang
@ 2009-06-16  0:03       ` Minchan Kim
  2009-06-16 13:49         ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-16  0:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Andrew Morton, LKML, Andi Kleen, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

On Mon, 15 Jun 2009 23:26:12 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > >
> > > When a page has the poison bit set replace the PTE with a poison entry.
> > > This causes the right error handling to be done later when a process runs
> > > into it.
> > >
> > > Also add a new flag to not do that (needed for the memory-failure handler
> > > later)
> > >
> > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > >
> > > ---
> > >  include/linux/rmap.h |    1 +
> > >  mm/rmap.c            |    9 ++++++++-
> > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > --- sound-2.6.orig/mm/rmap.c
> > > +++ sound-2.6/mm/rmap.c
> > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> > >        /* Update high watermark before we lower rss */
> > >        update_hiwater_rss(mm);
> > >
> > > -       if (PageAnon(page)) {
> > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> > > +               if (PageAnon(page))
> > > +                       dec_mm_counter(mm, anon_rss);
> > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> > 
> > Isn't it straightforward to use !is_hwpoison_entry ?
> 
> Good catch!  It looks like a redundant check: the
> page_check_address() at the beginning of the function guarantees that 
> !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> So let's do this?
It seems you expand my sight :)

I don't know migration well.
How page_check_address guarantee it's not migration entry ? 

In addtion, If the page is poison while we are going to migration((PAGE_MIGRATION && migration) == TRUE), we should decrease file_rss ?

> 
> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> +               else
> 
> 
> Thanks,
> Fengguang


-- 
Kinds Regards
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-15 23:52       ` KAMEZAWA Hiroyuki
@ 2009-06-16  0:34         ` Wu Fengguang
  2009-06-16 11:29           ` Hugh Dickins
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-16  0:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Tue, Jun 16, 2009 at 07:52:22AM +0800, KAMEZAWA Hiroyuki wrote:
> On Mon, 15 Jun 2009 18:16:20 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Mon, Jun 15, 2009 at 05:41:12PM +0800, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 15 Jun 2009 10:45:30 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > > 
> > > > If memory corruption hits the free buddy pages, we can safely ignore them.
> > > > No one will access them until page allocation time, then prep_new_page()
> > > > will automatically check and isolate PG_hwpoison page for us (for 0-order
> > > > allocation).
> > > > 
> > > > This patch expands prep_new_page() to check every component page in a high
> > > > order page allocation, in order to completely stop PG_hwpoison pages from
> > > > being recirculated.
> > > > 
> > > > Note that the common case -- only allocating a single page, doesn't
> > > > do any more work than before. Allocating > order 0 does a bit more work,
> > > > but that's relatively uncommon.
> > > > 
> > > > This simple implementation may drop some innocent neighbor pages, hopefully
> > > > it is not a big problem because the event should be rare enough.
> > > > 
> > > > This patch adds some runtime costs to high order page users.
> > > > 
> > > > [AK: Improved description]
> > > > 
> > > > v2: Andi Kleen:
> > > > Port to -mm code
> > > > Move check into separate function.
> > > > Don't dump stack in bad_pages for hwpoisoned pages.
> > > > 
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > > 
> > > > ---
> > > >  mm/page_alloc.c |   20 +++++++++++++++++++-
> > > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > > > 
> > > > --- sound-2.6.orig/mm/page_alloc.c
> > > > +++ sound-2.6/mm/page_alloc.c
> > > > @@ -233,6 +233,12 @@ static void bad_page(struct page *page)
> > > >  	static unsigned long nr_shown;
> > > >  	static unsigned long nr_unshown;
> > > >  
> > > > +	/* Don't complain about poisoned pages */
> > > > +	if (PageHWPoison(page)) {
> > > > +		__ClearPageBuddy(page);
> > > > +		return;
> > > > +	}
> > > 
> > > Hmm ? why __ClearPageBuddy() is necessary ?
> > 
> > Because this page is considered to be "allocated" out of the buddy
> > system, even though we fail the allocation here.
> > 
> > The page is now owned by no one, especially not owned by the buddy
> > allocator.
> > 
> I just wonder "why __ClearPageBuddy() is necessary."
> 
> When bad_page() is called,  a page is removed from buddy allocator and no
> PG_buddy flag at all....I'm sorry if you added bad_page() caller in buddy allocator.

You are right. But I didn't add bad_page() callers either :)

> Buddy Allocator I call here is just 2 functions.
>  - __free_one_page()
>  - expand()

Right.  Then the original __ClearPageBuddy() call in bad_page() is
questionable, I guess this line was there just for the sake of safety
(ie. the buddy allocator itself goes wrong):

sound-2.6/mm/page_alloc.c

        @@ -269,7 +269,6 @@ static void bad_page(struct page *page)
                dump_stack();
         out:
                /* Leave bad fields for debug, except PageBuddy could make trouble */
===>            __ClearPageBuddy(page);
                add_taint(TAINT_BAD_PAGE);
         }


Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 21/22] HWPOISON: send uevent to report memory corruption
  2009-06-15  2:45 ` [PATCH 21/22] HWPOISON: send uevent to report " Wu Fengguang
  2009-06-15  6:29   ` Andi Kleen
@ 2009-06-16  0:35   ` Greg KH
  1 sibling, 0 replies; 79+ messages in thread
From: Greg KH @ 2009-06-16  0:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, Hugh Dickins,
	Andi Kleen, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 10:45:41AM +0800, Wu Fengguang wrote:
> +static void hwpoison_release(struct kobject *kobj)
> +{
> +}
> +
> +static struct kobj_type hwpoison_ktype = {
> +	.release = hwpoison_release,
> +};

{sigh}

Why did you include an empty release function?  Was it because the
kernel complained when you had no release function?  So, why would you
think that the acceptable solution to that warning would be an empty
release function instead?

Hint, this is totally wrong, provide a release function that ACTUALLY
DOES SOMETHING!!!  Read the kobject documentation for details as to what
you need to do here.

ugh, I'm so tired of repeating this year after year after year, I feel
like a broken record...

This is broken, please fix it.

greg k-h

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-16  0:34         ` Wu Fengguang
@ 2009-06-16 11:29           ` Hugh Dickins
  2009-06-16 11:40             ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Hugh Dickins @ 2009-06-16 11:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, LKML, Andi Kleen, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Andi Kleen, riel, chris.mason, linux-mm

On Tue, 16 Jun 2009, Wu Fengguang wrote:
> 
> Right.  Then the original __ClearPageBuddy() call in bad_page() is
> questionable, I guess this line was there just for the sake of safety
> (ie. the buddy allocator itself goes wrong):
> 
> sound-2.6/mm/page_alloc.c
> 
>         @@ -269,7 +269,6 @@ static void bad_page(struct page *page)
>                 dump_stack();
>          out:
>                 /* Leave bad fields for debug, except PageBuddy could make trouble */
> ===>            __ClearPageBuddy(page);
>                 add_taint(TAINT_BAD_PAGE);
>          }

I didn't put that in for the case of the buddy allocator going wrong
(not sure if there could be such a case - I don't mean that the buddy
allocator is provably perfect! but how would it get here if it were
wrong?).  No, I put that in for the case when the flag bits in struct
page have themselves got corrupted somehow, and hence we arrive at
bad_page(): most of the bits are best left as they are, to provide
maximum debug info; but leaving PageBuddy set there might conceivably
allow this corrupted struct page to get paired up with its buddy later,
and so freed for reuse, when we're trying to make sure it's never reused.

Hugh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2
  2009-06-16 11:29           ` Hugh Dickins
@ 2009-06-16 11:40             ` Wu Fengguang
  0 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-16 11:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, LKML, Andi Kleen, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Andi Kleen, riel, chris.mason, linux-mm

On Tue, Jun 16, 2009 at 07:29:45PM +0800, Hugh Dickins wrote:
> On Tue, 16 Jun 2009, Wu Fengguang wrote:
> > 
> > Right.  Then the original __ClearPageBuddy() call in bad_page() is
> > questionable, I guess this line was there just for the sake of safety
> > (ie. the buddy allocator itself goes wrong):
> > 
> > sound-2.6/mm/page_alloc.c
> > 
> >         @@ -269,7 +269,6 @@ static void bad_page(struct page *page)
> >                 dump_stack();
> >          out:
> >                 /* Leave bad fields for debug, except PageBuddy could make trouble */
> > ===>            __ClearPageBuddy(page);
> >                 add_taint(TAINT_BAD_PAGE);
> >          }
> 
> I didn't put that in for the case of the buddy allocator going wrong
> (not sure if there could be such a case - I don't mean that the buddy
> allocator is provably perfect! but how would it get here if it were
> wrong?).  No, I put that in for the case when the flag bits in struct
> page have themselves got corrupted somehow, and hence we arrive at
> bad_page(): most of the bits are best left as they are, to provide
> maximum debug info; but leaving PageBuddy set there might conceivably
> allow this corrupted struct page to get paired up with its buddy later,
> and so freed for reuse, when we're trying to make sure it's never reused.

Hugh, thank you for the detailed explanations!  You are always informative :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-16  0:03       ` Minchan Kim
@ 2009-06-16 13:49         ` Wu Fengguang
  2009-06-17  0:28           ` Minchan Kim
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-16 13:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
> On Mon, 15 Jun 2009 23:26:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > From: Andi Kleen <ak@linux.intel.com>
> > > >
> > > > When a page has the poison bit set replace the PTE with a poison entry.
> > > > This causes the right error handling to be done later when a process runs
> > > > into it.
> > > >
> > > > Also add a new flag to not do that (needed for the memory-failure handler
> > > > later)
> > > >
> > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > >
> > > > ---
> > > >  include/linux/rmap.h |    1 +
> > > >  mm/rmap.c            |    9 ++++++++-
> > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > --- sound-2.6.orig/mm/rmap.c
> > > > +++ sound-2.6/mm/rmap.c
> > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> > > >        /* Update high watermark before we lower rss */
> > > >        update_hiwater_rss(mm);
> > > >
> > > > -       if (PageAnon(page)) {
> > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> > > > +               if (PageAnon(page))
> > > > +                       dec_mm_counter(mm, anon_rss);
> > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> > > 
> > > Isn't it straightforward to use !is_hwpoison_entry ?
> > 
> > Good catch!  It looks like a redundant check: the
> > page_check_address() at the beginning of the function guarantees that 
> > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> > So let's do this?
> It seems you expand my sight :)
> 
> I don't know migration well.
> How page_check_address guarantee it's not migration entry ? 

page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines

#define __swp_entry(type, offset)       ((swp_entry_t) { \
                                         ((type) << (_PAGE_BIT_PRESENT + 1)) \
                                         | ((offset) << SWP_OFFSET_SHIFT) })

where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.

So __swp_entry(type, offset) := (type << 1) | (offset << 9)

We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
 

> In addtion, If the page is poison while we are going to
> migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
> file_rss ?

It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
isolated so we don't care that die either.

Thanks,
Fengguang

> > 
> > -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> > +               else
> > 
> > 
> > Thanks,
> > Fengguang
> 
> 
> -- 
> Kinds Regards
> Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-15 13:29         ` Andi Kleen
  2009-06-15 13:28           ` H. Peter Anvin
  2009-06-15 14:48           ` Alan Cox
@ 2009-06-16 19:44           ` Russ Anderson
  2009-06-16 20:28             ` H. Peter Anvin
  2 siblings, 1 reply; 79+ messages in thread
From: Russ Anderson @ 2009-06-16 19:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Nick Piggin, riel, chris.mason,
	linux-mm, rja

On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
> 
> I think you're wrong about killing processes decreasing
> reliability. Traditionally we always tried to keep things running if possible
> instead of panicing.

Customers love the ia64 feature of killing a user process instead of
panicing the system when a user process hits a memory uncorrectable
error.  Avoiding a system panic is a very good thing.


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-16 19:44           ` Russ Anderson
@ 2009-06-16 20:28             ` H. Peter Anvin
  2009-06-16 20:54               ` Russ Anderson
  0 siblings, 1 reply; 79+ messages in thread
From: H. Peter Anvin @ 2009-06-16 20:28 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Andi Kleen, Alan Cox, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	Peter Zijlstra, Nick Piggin, riel, chris.mason, linux-mm

Russ Anderson wrote:
> On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
>> I think you're wrong about killing processes decreasing
>> reliability. Traditionally we always tried to keep things running if possible
>> instead of panicing.
> 
> Customers love the ia64 feature of killing a user process instead of
> panicing the system when a user process hits a memory uncorrectable
> error.  Avoiding a system panic is a very good thing.

Sometimes (sometimes it's a very bad thing.)

However, the more fundamental thing is that it is always trivial to
promote an error to a higher severity; the opposite is not true.  As
such, it becomes an administrator-set policy, which is what it needs to be.

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-16 20:28             ` H. Peter Anvin
@ 2009-06-16 20:54               ` Russ Anderson
  2009-06-16 20:58                 ` H. Peter Anvin
  0 siblings, 1 reply; 79+ messages in thread
From: Russ Anderson @ 2009-06-16 20:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Alan Cox, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	Peter Zijlstra, Nick Piggin, riel, chris.mason, linux-mm, rja

On Tue, Jun 16, 2009 at 01:28:54PM -0700, H. Peter Anvin wrote:
> Russ Anderson wrote:
> > On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
> >> I think you're wrong about killing processes decreasing
> >> reliability. Traditionally we always tried to keep things running if possible
> >> instead of panicing.
> > 
> > Customers love the ia64 feature of killing a user process instead of
> > panicing the system when a user process hits a memory uncorrectable
> > error.  Avoiding a system panic is a very good thing.
> 
> Sometimes (sometimes it's a very bad thing.)
> 
> However, the more fundamental thing is that it is always trivial to
> promote an error to a higher severity; the opposite is not true.  As
> such, it becomes an administrator-set policy, which is what it needs to be.

Good point.  On ia64 the recovery code is implemented as a kernel
loadable module.  Installing the module turns on the feature.

That is handy for customer demos.  Install the module, inject a
memory error, have an application read the bad data and get killed.
Repeat a few times.  Then uninstall the module, inject a
memory error, have an application read the bad data and watch
the system panic.

Then it is the customer's choice to have it on or off.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/22] HWPOISON: Intro (v5)
  2009-06-16 20:54               ` Russ Anderson
@ 2009-06-16 20:58                 ` H. Peter Anvin
  0 siblings, 0 replies; 79+ messages in thread
From: H. Peter Anvin @ 2009-06-16 20:58 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Andi Kleen, Alan Cox, Hugh Dickins, Wu Fengguang, Balbir Singh,
	Andrew Morton, LKML, Ingo Molnar, Mel Gorman, Thomas Gleixner,
	Peter Zijlstra, Nick Piggin, riel, chris.mason, linux-mm

Russ Anderson wrote:
>>
>> However, the more fundamental thing is that it is always trivial to
>> promote an error to a higher severity; the opposite is not true.  As
>> such, it becomes an administrator-set policy, which is what it needs to be.
> 
> Good point.  On ia64 the recovery code is implemented as a kernel
> loadable module.  Installing the module turns on the feature.
> 
> That is handy for customer demos.  Install the module, inject a
> memory error, have an application read the bad data and get killed.
> Repeat a few times.  Then uninstall the module, inject a
> memory error, have an application read the bad data and watch
> the system panic.
> 
> Then it is the customer's choice to have it on or off.
> 

There are a number of ways to set escalation policy.  Modules isn't
necessarily the best, but it doesn't really matter what the exact
mechanism is.

	-hpa

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-16 13:49         ` Wu Fengguang
@ 2009-06-17  0:28           ` Minchan Kim
  2009-06-17  7:23             ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-17  0:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Andrew Morton, LKML, Andi Kleen, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

On Tue, 16 Jun 2009 21:49:44 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
> > On Mon, 15 Jun 2009 23:26:12 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > From: Andi Kleen <ak@linux.intel.com>
> > > > >
> > > > > When a page has the poison bit set replace the PTE with a poison entry.
> > > > > This causes the right error handling to be done later when a process runs
> > > > > into it.
> > > > >
> > > > > Also add a new flag to not do that (needed for the memory-failure handler
> > > > > later)
> > > > >
> > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > > >
> > > > > ---
> > > > >  include/linux/rmap.h |    1 +
> > > > >  mm/rmap.c            |    9 ++++++++-
> > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > --- sound-2.6.orig/mm/rmap.c
> > > > > +++ sound-2.6/mm/rmap.c
> > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> > > > >        /* Update high watermark before we lower rss */
> > > > >        update_hiwater_rss(mm);
> > > > >
> > > > > -       if (PageAnon(page)) {
> > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> > > > > +               if (PageAnon(page))
> > > > > +                       dec_mm_counter(mm, anon_rss);
> > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> > > > 
> > > > Isn't it straightforward to use !is_hwpoison_entry ?
> > > 
> > > Good catch!  It looks like a redundant check: the
> > > page_check_address() at the beginning of the function guarantees that 
> > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> > > So let's do this?
> > It seems you expand my sight :)
> > 
> > I don't know migration well.
> > How page_check_address guarantee it's not migration entry ? 
> 
> page_check_address() calls pte_present() which returns the
> (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
> 
> #define __swp_entry(type, offset)       ((swp_entry_t) { \
>                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
>                                          | ((offset) << SWP_OFFSET_SHIFT) })
> 
> where SWP_OFFSET_SHIFT is defined to the bigger one of
> max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
> 
> So __swp_entry(type, offset) := (type << 1) | (offset << 9)
> 
> We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
> _PAGE_PROTNONE will all be zero for swap entries.
>  

Thanks for kind explanation :)

> 
> > In addtion, If the page is poison while we are going to
> > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
> > file_rss ?
> 
> It will die on trying to migrate the poisoned page so we don't care
> the accounting. But normally the poisoned page shall already be


Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page

-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))

> isolated so we don't care that die either.
> Thanks,
> Fengguang
> 
> > > 
> > > -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> > > +               else
> > > 
> > > 
> > > Thanks,
> > > Fengguang
> > 
> > 
> > -- 
> > Kinds Regards
> > Minchan Kim


-- 
Kinds Regards
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler
  2009-06-15 14:22               ` Wu Fengguang
@ 2009-06-17  6:37                 ` Wu Fengguang
  2009-06-17  8:04                   ` Nick Piggin
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17  6:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Mon, Jun 15, 2009 at 10:22:25PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> > On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > > For KVM you need early kill, for the others it remains to be seen.
> > > > 
> > > > Right. It's almost like you need to do a per-process thing, and
> > > > those that can handle things (such as the new SIGBUS or the new
> > > > EIO) could get those, and others could be killed.
> > > 
> > > To send early SIGBUS kills to processes who has called
> > > sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
> > > don't mind they can understand that signal at all.
> > 
> > For apps that hook into SIGBUS for some other means and
> 
> Yes I was referring to the sigaction(SIGBUS) apps, others will
> be late killed anyway.
> 
> > do not understand the new type of SIGBUS signal? What about
> > those?
> 
> We introduced two new SIGBUS codes:
>         BUS_MCEERR_AO=5         for early kill
>         BUS_MCEERR_AR=4         for late  kill
> I'd assume a legacy application will handle them in the same way (both
> are unexpected code to the application).
> 
> We don't care whether the application can be killed by BUS_MCEERR_AO
> or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
> But (in the rare case) if the handler
> - refused to die on BUS_MCEERR_AR, it may create a busy loop and
>   flooding of SIGBUS signals, which is a bug of the application.
>   BUS_MCEERR_AO is one time and won't lead to busy loops.
> - does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
>   it may well hurt the same way on BUS_MCEERR_AR. The latter one is
>   unavoidable, so the application must be fixed anyway.

This patch materializes the automatically early kill idea.
It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.

This is mainly a policy change, please comment.

Thanks,
Fengguang

---
HWPOISON: only early kill processes who installed SIGBUS handler

We want to send SIGBUS.BUS_MCEERR_AO signals to KVM ASAP, so that
it is able to take actions to isolate the corrupted page. In fact,
any applications that does extensive internal caching (KVM, Oracle,
etc.) is advised to install a SIGBUS handler to get early notifications
of corrupted memory, so that it has good possibility to find and remove
the page from its cache. If don't do so, they will later receive the
SIGBUS.BUS_MCEERR_AR signal on accessing the corrupted memory, which
can be deadly (too hard to rescue).

For applications that don't care the signal, let them continue to run
until they try to consume the corrupted data.

For applications that used to catch the SIGBUS handler but don't understand
the new BUS_MCEERR_AO/BUS_MCEERR_AR codes, they may
- refused to die on BUS_MCEERR_AR, creating a busy loop and
  flooding of SIGBUS signals, which is a bug of the application.
  BUS_MCEERR_AO is an one shot event and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
  it may well hurt the same way on BUS_MCEERR_AR. The latter one is
  unavoidable, so the application must be fixed anyway.


CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -205,6 +205,20 @@ static void kill_procs_ao(struct list_he
 	}
 }
 
+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+	__sighandler_t handler;
+
+	if (!tsk->mm)
+		return false;
+
+	handler = tsk->sighand->action[SIGBUS-1].sa.sa_handler;
+	if (handler == SIG_DFL || handler == SIG_IGN)
+		return false;
+
+	return true;
+}
+
 /*
  * Collect processes when the error hit an anonymous page.
  */
@@ -222,7 +236,7 @@ static void collect_procs_anon(struct pa
 		goto out;
 
 	for_each_process (tsk) {
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
 			if (!page_mapped_in_vma(page, vma))
@@ -262,7 +276,7 @@ static void collect_procs_file(struct pa
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-17  0:28           ` Minchan Kim
@ 2009-06-17  7:23             ` Wu Fengguang
  2009-06-17 13:27               ` Minchan Kim
       [not found]               ` <28c262360906170644w65c08a8y2d2805fb08045804@mail.gmail.com>
  0 siblings, 2 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17  7:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
> On Tue, 16 Jun 2009 21:49:44 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
> > > On Mon, 15 Jun 2009 23:26:12 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > > > > > From: Andi Kleen <ak@linux.intel.com>
> > > > > >
> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
> > > > > > This causes the right error handling to be done later when a process runs
> > > > > > into it.
> > > > > >
> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
> > > > > > later)
> > > > > >
> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > > > >
> > > > > > ---
> > > > > >  include/linux/rmap.h |    1 +
> > > > > >  mm/rmap.c            |    9 ++++++++-
> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > --- sound-2.6.orig/mm/rmap.c
> > > > > > +++ sound-2.6/mm/rmap.c
> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> > > > > >        /* Update high watermark before we lower rss */
> > > > > >        update_hiwater_rss(mm);
> > > > > >
> > > > > > -       if (PageAnon(page)) {
> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> > > > > > +               if (PageAnon(page))
> > > > > > +                       dec_mm_counter(mm, anon_rss);
> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> > > > > 
> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
> > > > 
> > > > Good catch!  It looks like a redundant check: the
> > > > page_check_address() at the beginning of the function guarantees that 
> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> > > > So let's do this?
> > > It seems you expand my sight :)
> > > 
> > > I don't know migration well.
> > > How page_check_address guarantee it's not migration entry ? 
> > 
> > page_check_address() calls pte_present() which returns the
> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
> > 
> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
> > 
> > where SWP_OFFSET_SHIFT is defined to the bigger one of
> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
> > 
> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
> > 
> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
> > _PAGE_PROTNONE will all be zero for swap entries.
> >  
> 
> Thanks for kind explanation :)

You are welcome~

> > 
> > > In addtion, If the page is poison while we are going to
> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
> > > file_rss ?
> > 
> > It will die on trying to migrate the poisoned page so we don't care
> > the accounting. But normally the poisoned page shall already be
> 
> 
> Okay. then, how about this ?
> We should not increase file_rss on trying to migrate the poisoned page
> 
> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> +               else if (!(PAGE_MIGRATION && migration))

This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.

Because this race window is small enough:

        TestSetPageHWPoison(p);
                                   lock_page(page);
                                   try_to_unmap(page, TTU_MIGRATION|...);
        lock_page_nosync(p);

such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.

For example, if the newly allocated page get corrupted, this kind of code who
assumes it is the only user of the page (but memory_failure() comes in between
like a ghost) will go BUG():

        /*
         * Block others from accessing the page when we get around to
         * establishing additional references. We are the only one
         * holding a reference to the new page at this point.
         */
        if (!trylock_page(newpage))
                BUG();

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler
  2009-06-17  6:37                 ` [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler Wu Fengguang
@ 2009-06-17  8:04                   ` Nick Piggin
  2009-06-17  9:55                     ` Wu Fengguang
  2009-06-18  9:56                     ` Wu Fengguang
  0 siblings, 2 replies; 79+ messages in thread
From: Nick Piggin @ 2009-06-17  8:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 02:37:02PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 10:22:25PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> > > On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > > > For KVM you need early kill, for the others it remains to be seen.
> > > > > 
> > > > > Right. It's almost like you need to do a per-process thing, and
> > > > > those that can handle things (such as the new SIGBUS or the new
> > > > > EIO) could get those, and others could be killed.
> > > > 
> > > > To send early SIGBUS kills to processes who has called
> > > > sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
> > > > don't mind they can understand that signal at all.
> > > 
> > > For apps that hook into SIGBUS for some other means and
> > 
> > Yes I was referring to the sigaction(SIGBUS) apps, others will
> > be late killed anyway.
> > 
> > > do not understand the new type of SIGBUS signal? What about
> > > those?
> > 
> > We introduced two new SIGBUS codes:
> >         BUS_MCEERR_AO=5         for early kill
> >         BUS_MCEERR_AR=4         for late  kill
> > I'd assume a legacy application will handle them in the same way (both
> > are unexpected code to the application).
> > 
> > We don't care whether the application can be killed by BUS_MCEERR_AO
> > or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
> > But (in the rare case) if the handler
> > - refused to die on BUS_MCEERR_AR, it may create a busy loop and
> >   flooding of SIGBUS signals, which is a bug of the application.
> >   BUS_MCEERR_AO is one time and won't lead to busy loops.
> > - does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
> >   it may well hurt the same way on BUS_MCEERR_AR. The latter one is
> >   unavoidable, so the application must be fixed anyway.
> 
> This patch materializes the automatically early kill idea.
> It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
> 
> This is mainly a policy change, please comment.

Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.

Can you use a prctl or something so it can expclitly
register interest in this?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler
  2009-06-17  8:04                   ` Nick Piggin
@ 2009-06-17  9:55                     ` Wu Fengguang
  2009-06-17 10:00                       ` Nick Piggin
  2009-06-18  9:56                     ` Wu Fengguang
  1 sibling, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17  9:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 04:04:04PM +0800, Nick Piggin wrote:
> On Wed, Jun 17, 2009 at 02:37:02PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 10:22:25PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> > > > On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > > > > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > > > > For KVM you need early kill, for the others it remains to be seen.
> > > > > > 
> > > > > > Right. It's almost like you need to do a per-process thing, and
> > > > > > those that can handle things (such as the new SIGBUS or the new
> > > > > > EIO) could get those, and others could be killed.
> > > > > 
> > > > > To send early SIGBUS kills to processes who has called
> > > > > sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
> > > > > don't mind they can understand that signal at all.
> > > > 
> > > > For apps that hook into SIGBUS for some other means and
> > > 
> > > Yes I was referring to the sigaction(SIGBUS) apps, others will
> > > be late killed anyway.
> > > 
> > > > do not understand the new type of SIGBUS signal? What about
> > > > those?
> > > 
> > > We introduced two new SIGBUS codes:
> > >         BUS_MCEERR_AO=5         for early kill
> > >         BUS_MCEERR_AR=4         for late  kill
> > > I'd assume a legacy application will handle them in the same way (both
> > > are unexpected code to the application).
> > > 
> > > We don't care whether the application can be killed by BUS_MCEERR_AO
> > > or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
> > > But (in the rare case) if the handler
> > > - refused to die on BUS_MCEERR_AR, it may create a busy loop and
> > >   flooding of SIGBUS signals, which is a bug of the application.
> > >   BUS_MCEERR_AO is one time and won't lead to busy loops.
> > > - does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
> > >   it may well hurt the same way on BUS_MCEERR_AR. The latter one is
> > >   unavoidable, so the application must be fixed anyway.
> > 
> > This patch materializes the automatically early kill idea.
> > It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
> > 
> > This is mainly a policy change, please comment.
> 
> Well then you can still early-kill random apps that did not
> want it, and you may still cause problems if its sigbus
> handler does something nontrivial.
> 
> Can you use a prctl or something so it can expclitly
> register interest in this?

No I don't think prctl would be much better.

- if an application want early/late kill, it can do so with a proper
  written SIGBUS handler: the prctl call is redundant.
- if an admin want to control early/late kill for an unmodified app,
  prctl is as unhelpful as this patch(*).
- prctl does can help legacy apps whose SIGBUS handler has trouble
  with the new SIGBUS codes, however such application should be rare
  and the application should be fixed(why shall it do something wrong
  on newly introduced code at all? Shall we stop introducing new codes
  just because some random buggy app cannot handle new codes?)

So I still prefer this patch, until we come up with some solution that
allows both app and admin to change the setting.

(*) Or if prctl options can be inherited across exec(), we may write a
wrapper tool to set early kill preference for some legacy app:

    # this_app_can_be_early_killed  some_legacy_app --legacy-options

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler
  2009-06-17  9:55                     ` Wu Fengguang
@ 2009-06-17 10:00                       ` Nick Piggin
  2009-06-17 11:56                         ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2009-06-17 10:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 05:55:32PM +0800, Wu Fengguang wrote:
> On Wed, Jun 17, 2009 at 04:04:04PM +0800, Nick Piggin wrote:
> > Well then you can still early-kill random apps that did not
> > want it, and you may still cause problems if its sigbus
> > handler does something nontrivial.
> > 
> > Can you use a prctl or something so it can expclitly
> > register interest in this?
> 
> No I don't think prctl would be much better.
> 
> - if an application want early/late kill, it can do so with a proper
>   written SIGBUS handler: the prctl call is redundant.

s/proper written/is switched to new semantics based on the existance
of a/

> - if an admin want to control early/late kill for an unmodified app,
>   prctl is as unhelpful as this patch(*).

Clearly you can execute a process with a given prctl.


> - prctl does can help legacy apps whose SIGBUS handler has trouble
>   with the new SIGBUS codes, however such application should be rare
>   and the application should be fixed(why shall it do something wrong
>   on newly introduced code at all? Shall we stop introducing new codes
>   just because some random buggy app cannot handle new codes?)

Backwards compatibility? Kind of important.


> So I still prefer this patch, until we come up with some solution that
> allows both app and admin to change the setting.

Not only does it allow that, but it also provides backwards
compatibility. Your patch does not allow admin to change
anything nor does it guarantee 100% back compat so I can't
see how you think it is better.

Also it does not allow for an app with a SIGBUS handler to
use late kill. If late kill is useful to anyone, why would
it not be useful to some app with a SIGBUS handler (that is
not KVM)?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler
  2009-06-17 10:00                       ` Nick Piggin
@ 2009-06-17 11:56                         ` Wu Fengguang
  0 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17 11:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 06:00:06PM +0800, Nick Piggin wrote:
> On Wed, Jun 17, 2009 at 05:55:32PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 17, 2009 at 04:04:04PM +0800, Nick Piggin wrote:
> > > Well then you can still early-kill random apps that did not
> > > want it, and you may still cause problems if its sigbus
> > > handler does something nontrivial.
> > > 
> > > Can you use a prctl or something so it can expclitly
> > > register interest in this?
> > 
> > No I don't think prctl would be much better.
> > 
> > - if an application want early/late kill, it can do so with a proper
> >   written SIGBUS handler: the prctl call is redundant.
> 
> s/proper written/is switched to new semantics based on the existance
> of a/

Not necessarily so. If an application
- did not has a SIGBUS handler, and want to be
  - early killed: must install a handler, this is not a big problem
    because it may well want to rescue something on the event.
  - late killed: just do nothing.
(here kill = 'notification')
- had a SIGBUS hander, and want to
  - early die: call exit(0) in the handler.
  - late die: intercept and ignore the signal.
So if source code modification is viable, prctl is not necessary at all.

> > - if an admin want to control early/late kill for an unmodified app,
> >   prctl is as unhelpful as this patch(*).
> 
> Clearly you can execute a process with a given prctl.

OK, right.

> > - prctl does can help legacy apps whose SIGBUS handler has trouble
> >   with the new SIGBUS codes, however such application should be rare
> >   and the application should be fixed(why shall it do something wrong
> >   on newly introduced code at all? Shall we stop introducing new codes
> >   just because some random buggy app cannot handle new codes?)
> 
> Backwards compatibility? Kind of important.

Maybe.

> > So I still prefer this patch, until we come up with some solution that
> > allows both app and admin to change the setting.
> 
> Not only does it allow that, but it also provides backwards
> compatibility. Your patch does not allow admin to change
> anything nor does it guarantee 100% back compat so I can't
> see how you think it is better.

I didn't say it is better, but clearly mean that prctl is not better
enough to warrant a new user interface, if(!adm_friendly). Now it's
obvious that adm_friendly=1, so I agree prctl is a good interface :)

> Also it does not allow for an app with a SIGBUS handler to
> use late kill. If late kill is useful to anyone, why would
> it not be useful to some app with a SIGBUS handler (that is
> not KVM)?

Late kill will always be sent. Ignore the early kill signal in the
SIGBUS handler does the trick (see above analyzes).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in  try_to_unmap
  2009-06-17  7:23             ` Wu Fengguang
@ 2009-06-17 13:27               ` Minchan Kim
  2009-06-17 13:37                 ` Wu Fengguang
       [not found]               ` <28c262360906170644w65c08a8y2d2805fb08045804@mail.gmail.com>
  1 sibling, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-17 13:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 4:23 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
>> On Tue, 16 Jun 2009 21:49:44 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
>> > > On Mon, 15 Jun 2009 23:26:12 +0800
>> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > >
>> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
>> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > > > > > From: Andi Kleen <ak@linux.intel.com>
>> > > > > >
>> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
>> > > > > > This causes the right error handling to be done later when a process runs
>> > > > > > into it.
>> > > > > >
>> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
>> > > > > > later)
>> > > > > >
>> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
>> > > > > >
>> > > > > > ---
>> > > > > >  include/linux/rmap.h |    1 +
>> > > > > >  mm/rmap.c            |    9 ++++++++-
>> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
>> > > > > >
>> > > > > > --- sound-2.6.orig/mm/rmap.c
>> > > > > > +++ sound-2.6/mm/rmap.c
>> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
>> > > > > >        /* Update high watermark before we lower rss */
>> > > > > >        update_hiwater_rss(mm);
>> > > > > >
>> > > > > > -       if (PageAnon(page)) {
>> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
>> > > > > > +               if (PageAnon(page))
>> > > > > > +                       dec_mm_counter(mm, anon_rss);
>> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
>> > > > >
>> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
>> > > >
>> > > > Good catch!  It looks like a redundant check: the
>> > > > page_check_address() at the beginning of the function guarantees that
>> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
>> > > > So let's do this?
>> > > It seems you expand my sight :)
>> > >
>> > > I don't know migration well.
>> > > How page_check_address guarantee it's not migration entry ?
>> >
>> > page_check_address() calls pte_present() which returns the
>> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
>> >
>> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
>> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
>> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
>> >
>> > where SWP_OFFSET_SHIFT is defined to the bigger one of
>> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
>> >
>> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
>> >
>> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
>> > _PAGE_PROTNONE will all be zero for swap entries.
>> >
>>
>> Thanks for kind explanation :)
>
> You are welcome~
>
>> >
>> > > In addtion, If the page is poison while we are going to
>> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
>> > > file_rss ?
>> >
>> > It will die on trying to migrate the poisoned page so we don't care
>> > the accounting. But normally the poisoned page shall already be
>>
>>
>> Okay. then, how about this ?
>> We should not increase file_rss on trying to migrate the poisoned page
>>
>> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
>> +               else if (!(PAGE_MIGRATION && migration))
>
> This is good if we are going to stop the hwpoison page from being
> consumed by move_to_new_page(), but I highly doubt we'll ever add
> PageHWPoison() checks into the migration code.
>
> Because this race window is small enough:
>
>        TestSetPageHWPoison(p);
>                                   lock_page(page);
>                                   try_to_unmap(page, TTU_MIGRATION|...);
>        lock_page_nosync(p);
>
> such small race windows can be found all over the kernel, it's just
> insane to try to fix any of them.

Sorry for too late response.

I see your point.
My opinion is that at least we must be notified when such situation happen.
So I think it would be better to add some warning to fix up it when it
happen even thought  it is small race window.

-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-17 13:27               ` Minchan Kim
@ 2009-06-17 13:37                 ` Wu Fengguang
  2009-06-17 13:43                   ` Minchan Kim
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17 13:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 09:27:36PM +0800, Minchan Kim wrote:
> On Wed, Jun 17, 2009 at 4:23 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
> >> On Tue, 16 Jun 2009 21:49:44 +0800
> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >>
> >> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
> >> > > On Mon, 15 Jun 2009 23:26:12 +0800
> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > >
> >> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> >> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > > > > > From: Andi Kleen <ak@linux.intel.com>
> >> > > > > >
> >> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
> >> > > > > > This causes the right error handling to be done later when a process runs
> >> > > > > > into it.
> >> > > > > >
> >> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
> >> > > > > > later)
> >> > > > > >
> >> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> >> > > > > >
> >> > > > > > ---
> >> > > > > >  include/linux/rmap.h |    1 +
> >> > > > > >  mm/rmap.c            |    9 ++++++++-
> >> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> >> > > > > >
> >> > > > > > --- sound-2.6.orig/mm/rmap.c
> >> > > > > > +++ sound-2.6/mm/rmap.c
> >> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> >> > > > > >        /* Update high watermark before we lower rss */
> >> > > > > >        update_hiwater_rss(mm);
> >> > > > > >
> >> > > > > > -       if (PageAnon(page)) {
> >> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> >> > > > > > +               if (PageAnon(page))
> >> > > > > > +                       dec_mm_counter(mm, anon_rss);
> >> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> >> > > > >
> >> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
> >> > > >
> >> > > > Good catch!  It looks like a redundant check: the
> >> > > > page_check_address() at the beginning of the function guarantees that
> >> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> >> > > > So let's do this?
> >> > > It seems you expand my sight :)
> >> > >
> >> > > I don't know migration well.
> >> > > How page_check_address guarantee it's not migration entry ?
> >> >
> >> > page_check_address() calls pte_present() which returns the
> >> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
> >> >
> >> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
> >> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
> >> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
> >> >
> >> > where SWP_OFFSET_SHIFT is defined to the bigger one of
> >> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
> >> >
> >> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
> >> >
> >> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
> >> > _PAGE_PROTNONE will all be zero for swap entries.
> >> >
> >>
> >> Thanks for kind explanation :)
> >
> > You are welcome~
> >
> >> >
> >> > > In addtion, If the page is poison while we are going to
> >> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
> >> > > file_rss ?
> >> >
> >> > It will die on trying to migrate the poisoned page so we don't care
> >> > the accounting. But normally the poisoned page shall already be
> >>
> >>
> >> Okay. then, how about this ?
> >> We should not increase file_rss on trying to migrate the poisoned page
> >>
> >> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> >> +               else if (!(PAGE_MIGRATION && migration))
> >
> > This is good if we are going to stop the hwpoison page from being
> > consumed by move_to_new_page(), but I highly doubt we'll ever add
> > PageHWPoison() checks into the migration code.
> >
> > Because this race window is small enough:
> >
> >        TestSetPageHWPoison(p);
> >                                   lock_page(page);
> >                                   try_to_unmap(page, TTU_MIGRATION|...);
> >        lock_page_nosync(p);
> >
> > such small race windows can be found all over the kernel, it's just
> > insane to try to fix any of them.
> 
> Sorry for too late response.
> 
> I see your point.
> My opinion is that at least we must be notified when such situation happen.
> So I think it would be better to add some warning to fix up it when it
> happen even thought  it is small race window.

Notification is also pointless here: we'll die hard on
accessing/consuming the poisoned page anyway :(

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in  try_to_unmap
  2009-06-17 13:37                 ` Wu Fengguang
@ 2009-06-17 13:43                   ` Minchan Kim
  2009-06-17 14:03                     ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-17 13:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 10:37 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Jun 17, 2009 at 09:27:36PM +0800, Minchan Kim wrote:
>> On Wed, Jun 17, 2009 at 4:23 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
>> >> On Tue, 16 Jun 2009 21:49:44 +0800
>> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >>
>> >> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
>> >> > > On Mon, 15 Jun 2009 23:26:12 +0800
>> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > >
>> >> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
>> >> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> > > > > > From: Andi Kleen <ak@linux.intel.com>
>> >> > > > > >
>> >> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
>> >> > > > > > This causes the right error handling to be done later when a process runs
>> >> > > > > > into it.
>> >> > > > > >
>> >> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
>> >> > > > > > later)
>> >> > > > > >
>> >> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> >> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
>> >> > > > > >
>> >> > > > > > ---
>> >> > > > > >  include/linux/rmap.h |    1 +
>> >> > > > > >  mm/rmap.c            |    9 ++++++++-
>> >> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
>> >> > > > > >
>> >> > > > > > --- sound-2.6.orig/mm/rmap.c
>> >> > > > > > +++ sound-2.6/mm/rmap.c
>> >> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
>> >> > > > > >        /* Update high watermark before we lower rss */
>> >> > > > > >        update_hiwater_rss(mm);
>> >> > > > > >
>> >> > > > > > -       if (PageAnon(page)) {
>> >> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
>> >> > > > > > +               if (PageAnon(page))
>> >> > > > > > +                       dec_mm_counter(mm, anon_rss);
>> >> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
>> >> > > > >
>> >> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
>> >> > > >
>> >> > > > Good catch!  It looks like a redundant check: the
>> >> > > > page_check_address() at the beginning of the function guarantees that
>> >> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
>> >> > > > So let's do this?
>> >> > > It seems you expand my sight :)
>> >> > >
>> >> > > I don't know migration well.
>> >> > > How page_check_address guarantee it's not migration entry ?
>> >> >
>> >> > page_check_address() calls pte_present() which returns the
>> >> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
>> >> >
>> >> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
>> >> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
>> >> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
>> >> >
>> >> > where SWP_OFFSET_SHIFT is defined to the bigger one of
>> >> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
>> >> >
>> >> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
>> >> >
>> >> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
>> >> > _PAGE_PROTNONE will all be zero for swap entries.
>> >> >
>> >>
>> >> Thanks for kind explanation :)
>> >
>> > You are welcome~
>> >
>> >> >
>> >> > > In addtion, If the page is poison while we are going to
>> >> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
>> >> > > file_rss ?
>> >> >
>> >> > It will die on trying to migrate the poisoned page so we don't care
>> >> > the accounting. But normally the poisoned page shall already be
>> >>
>> >>
>> >> Okay. then, how about this ?
>> >> We should not increase file_rss on trying to migrate the poisoned page
>> >>
>> >> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
>> >> +               else if (!(PAGE_MIGRATION && migration))
>> >
>> > This is good if we are going to stop the hwpoison page from being
>> > consumed by move_to_new_page(), but I highly doubt we'll ever add
>> > PageHWPoison() checks into the migration code.
>> >
>> > Because this race window is small enough:
>> >
>> >        TestSetPageHWPoison(p);
>> >                                   lock_page(page);
>> >                                   try_to_unmap(page, TTU_MIGRATION|...);
>> >        lock_page_nosync(p);
>> >
>> > such small race windows can be found all over the kernel, it's just
>> > insane to try to fix any of them.
>>
>> Sorry for too late response.
>>
>> I see your point.
>> My opinion is that at least we must be notified when such situation happen.
>> So I think it would be better to add some warning to fix up it when it
>> happen even thought  it is small race window.
>
> Notification is also pointless here: we'll die hard on
> accessing/consuming the poisoned page anyway :(

My intention wasn't to recover it.

It just add something like WARN_ON.
You said it is small window enough. but I think it can happen more
hight probability in migration-workload.(At a moment, I don't know
what kinds of app)
For such case, If we can hear reporting of warning, at that time we
can consider migration handling for HWPoison.

> Thanks,
> Fengguang
>
>



-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-17 13:43                   ` Minchan Kim
@ 2009-06-17 14:03                     ` Wu Fengguang
  2009-06-17 14:08                       ` Minchan Kim
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17 14:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 09:43:29PM +0800, Minchan Kim wrote:
> On Wed, Jun 17, 2009 at 10:37 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Jun 17, 2009 at 09:27:36PM +0800, Minchan Kim wrote:
> >> On Wed, Jun 17, 2009 at 4:23 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
> >> >> On Tue, 16 Jun 2009 21:49:44 +0800
> >> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >>
> >> >> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
> >> >> > > On Mon, 15 Jun 2009 23:26:12 +0800
> >> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> > >
> >> >> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> >> >> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> >> > > > > > From: Andi Kleen <ak@linux.intel.com>
> >> >> > > > > >
> >> >> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
> >> >> > > > > > This causes the right error handling to be done later when a process runs
> >> >> > > > > > into it.
> >> >> > > > > >
> >> >> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
> >> >> > > > > > later)
> >> >> > > > > >
> >> >> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> >> >> > > > > >
> >> >> > > > > > ---
> >> >> > > > > >  include/linux/rmap.h |    1 +
> >> >> > > > > >  mm/rmap.c            |    9 ++++++++-
> >> >> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> >> >> > > > > >
> >> >> > > > > > --- sound-2.6.orig/mm/rmap.c
> >> >> > > > > > +++ sound-2.6/mm/rmap.c
> >> >> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> >> >> > > > > >        /* Update high watermark before we lower rss */
> >> >> > > > > >        update_hiwater_rss(mm);
> >> >> > > > > >
> >> >> > > > > > -       if (PageAnon(page)) {
> >> >> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> >> >> > > > > > +               if (PageAnon(page))
> >> >> > > > > > +                       dec_mm_counter(mm, anon_rss);
> >> >> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> >> >> > > > >
> >> >> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
> >> >> > > >
> >> >> > > > Good catch!  It looks like a redundant check: the
> >> >> > > > page_check_address() at the beginning of the function guarantees that
> >> >> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> >> >> > > > So let's do this?
> >> >> > > It seems you expand my sight :)
> >> >> > >
> >> >> > > I don't know migration well.
> >> >> > > How page_check_address guarantee it's not migration entry ?
> >> >> >
> >> >> > page_check_address() calls pte_present() which returns the
> >> >> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
> >> >> >
> >> >> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
> >> >> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
> >> >> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
> >> >> >
> >> >> > where SWP_OFFSET_SHIFT is defined to the bigger one of
> >> >> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
> >> >> >
> >> >> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
> >> >> >
> >> >> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
> >> >> > _PAGE_PROTNONE will all be zero for swap entries.
> >> >> >
> >> >>
> >> >> Thanks for kind explanation :)
> >> >
> >> > You are welcome~
> >> >
> >> >> >
> >> >> > > In addtion, If the page is poison while we are going to
> >> >> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
> >> >> > > file_rss ?
> >> >> >
> >> >> > It will die on trying to migrate the poisoned page so we don't care
> >> >> > the accounting. But normally the poisoned page shall already be
> >> >>
> >> >>
> >> >> Okay. then, how about this ?
> >> >> We should not increase file_rss on trying to migrate the poisoned page
> >> >>
> >> >> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> >> >> +               else if (!(PAGE_MIGRATION && migration))
> >> >
> >> > This is good if we are going to stop the hwpoison page from being
> >> > consumed by move_to_new_page(), but I highly doubt we'll ever add
> >> > PageHWPoison() checks into the migration code.
> >> >
> >> > Because this race window is small enough:
> >> >
> >> >        TestSetPageHWPoison(p);
> >> >                                   lock_page(page);
> >> >                                   try_to_unmap(page, TTU_MIGRATION|...);
> >> >        lock_page_nosync(p);
> >> >
> >> > such small race windows can be found all over the kernel, it's just
> >> > insane to try to fix any of them.
> >>
> >> Sorry for too late response.
> >>
> >> I see your point.
> >> My opinion is that at least we must be notified when such situation happen.
> >> So I think it would be better to add some warning to fix up it when it
> >> happen even thought  it is small race window.
> >
> > Notification is also pointless here: we'll die hard on
> > accessing/consuming the poisoned page anyway :(
> 
> My intention wasn't to recover it.

Yes, that's not the point.

> It just add something like WARN_ON.
> You said it is small window enough. but I think it can happen more
> hight probability in migration-workload.(At a moment, I don't know
> what kinds of app)
> For such case, If we can hear reporting of warning, at that time we
> can consider migration handling for HWPoison.

The point is, any page can go corrupted any time. We don't need to add
1000 PageHWPoison() tests in the kernel like this. We don't aim for
100% protection, that's impossible. I'd be very contented if ever it
can reach 80% coverage :)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in  try_to_unmap
  2009-06-17 14:03                     ` Wu Fengguang
@ 2009-06-17 14:08                       ` Minchan Kim
  2009-06-17 14:12                         ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-17 14:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 11:03 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Jun 17, 2009 at 09:43:29PM +0800, Minchan Kim wrote:
>> On Wed, Jun 17, 2009 at 10:37 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Jun 17, 2009 at 09:27:36PM +0800, Minchan Kim wrote:
>> >> On Wed, Jun 17, 2009 at 4:23 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> > On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
>> >> >> On Tue, 16 Jun 2009 21:49:44 +0800
>> >> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> >>
>> >> >> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
>> >> >> > > On Mon, 15 Jun 2009 23:26:12 +0800
>> >> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> >> > >
>> >> >> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
>> >> >> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> >> >> > > > > > From: Andi Kleen <ak@linux.intel.com>
>> >> >> > > > > >
>> >> >> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
>> >> >> > > > > > This causes the right error handling to be done later when a process runs
>> >> >> > > > > > into it.
>> >> >> > > > > >
>> >> >> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
>> >> >> > > > > > later)
>> >> >> > > > > >
>> >> >> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>> >> >> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
>> >> >> > > > > >
>> >> >> > > > > > ---
>> >> >> > > > > >  include/linux/rmap.h |    1 +
>> >> >> > > > > >  mm/rmap.c            |    9 ++++++++-
>> >> >> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
>> >> >> > > > > >
>> >> >> > > > > > --- sound-2.6.orig/mm/rmap.c
>> >> >> > > > > > +++ sound-2.6/mm/rmap.c
>> >> >> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
>> >> >> > > > > >        /* Update high watermark before we lower rss */
>> >> >> > > > > >        update_hiwater_rss(mm);
>> >> >> > > > > >
>> >> >> > > > > > -       if (PageAnon(page)) {
>> >> >> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
>> >> >> > > > > > +               if (PageAnon(page))
>> >> >> > > > > > +                       dec_mm_counter(mm, anon_rss);
>> >> >> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
>> >> >> > > > >
>> >> >> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
>> >> >> > > >
>> >> >> > > > Good catch!  It looks like a redundant check: the
>> >> >> > > > page_check_address() at the beginning of the function guarantees that
>> >> >> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
>> >> >> > > > So let's do this?
>> >> >> > > It seems you expand my sight :)
>> >> >> > >
>> >> >> > > I don't know migration well.
>> >> >> > > How page_check_address guarantee it's not migration entry ?
>> >> >> >
>> >> >> > page_check_address() calls pte_present() which returns the
>> >> >> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
>> >> >> >
>> >> >> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
>> >> >> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
>> >> >> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
>> >> >> >
>> >> >> > where SWP_OFFSET_SHIFT is defined to the bigger one of
>> >> >> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
>> >> >> >
>> >> >> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
>> >> >> >
>> >> >> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
>> >> >> > _PAGE_PROTNONE will all be zero for swap entries.
>> >> >> >
>> >> >>
>> >> >> Thanks for kind explanation :)
>> >> >
>> >> > You are welcome~
>> >> >
>> >> >> >
>> >> >> > > In addtion, If the page is poison while we are going to
>> >> >> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
>> >> >> > > file_rss ?
>> >> >> >
>> >> >> > It will die on trying to migrate the poisoned page so we don't care
>> >> >> > the accounting. But normally the poisoned page shall already be
>> >> >>
>> >> >>
>> >> >> Okay. then, how about this ?
>> >> >> We should not increase file_rss on trying to migrate the poisoned page
>> >> >>
>> >> >> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
>> >> >> +               else if (!(PAGE_MIGRATION && migration))
>> >> >
>> >> > This is good if we are going to stop the hwpoison page from being
>> >> > consumed by move_to_new_page(), but I highly doubt we'll ever add
>> >> > PageHWPoison() checks into the migration code.
>> >> >
>> >> > Because this race window is small enough:
>> >> >
>> >> >        TestSetPageHWPoison(p);
>> >> >                                   lock_page(page);
>> >> >                                   try_to_unmap(page, TTU_MIGRATION|...);
>> >> >        lock_page_nosync(p);
>> >> >
>> >> > such small race windows can be found all over the kernel, it's just
>> >> > insane to try to fix any of them.
>> >>
>> >> Sorry for too late response.
>> >>
>> >> I see your point.
>> >> My opinion is that at least we must be notified when such situation happen.
>> >> So I think it would be better to add some warning to fix up it when it
>> >> happen even thought  it is small race window.
>> >
>> > Notification is also pointless here: we'll die hard on
>> > accessing/consuming the poisoned page anyway :(
>>
>> My intention wasn't to recover it.
>
> Yes, that's not the point.
>
>> It just add something like WARN_ON.
>> You said it is small window enough. but I think it can happen more
>> hight probability in migration-workload.(At a moment, I don't know
>> what kinds of app)
>> For such case, If we can hear reporting of warning, at that time we
>> can consider migration handling for HWPoison.
>
> The point is, any page can go corrupted any time. We don't need to add
> 1000 PageHWPoison() tests in the kernel like this. We don't aim for
> 100% protection, that's impossible. I'd be very contented if ever it
> can reach 80% coverage :)

Okay.
If it is your goal, I also think migration portion of all is very small.
Thanks for kind reply for my boring discussion.

Reviewed-by : Minchan Kim <minchan.kim@gmail.com>

> Thanks,
> Fengguang
>
>



-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-17 14:08                       ` Minchan Kim
@ 2009-06-17 14:12                         ` Wu Fengguang
  0 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-17 14:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 10:08:13PM +0800, Minchan Kim wrote:
> On Wed, Jun 17, 2009 at 11:03 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Jun 17, 2009 at 09:43:29PM +0800, Minchan Kim wrote:
> >> On Wed, Jun 17, 2009 at 10:37 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Jun 17, 2009 at 09:27:36PM +0800, Minchan Kim wrote:
> >> >> On Wed, Jun 17, 2009 at 4:23 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> >> > On Wed, Jun 17, 2009 at 08:28:26AM +0800, Minchan Kim wrote:
> >> >> >> On Tue, 16 Jun 2009 21:49:44 +0800
> >> >> >> Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> >>
> >> >> >> > On Tue, Jun 16, 2009 at 08:03:08AM +0800, Minchan Kim wrote:
> >> >> >> > > On Mon, 15 Jun 2009 23:26:12 +0800
> >> >> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> >> > >
> >> >> >> > > > On Mon, Jun 15, 2009 at 09:09:03PM +0800, Minchan Kim wrote:
> >> >> >> > > > > On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> >> >> > > > > > From: Andi Kleen <ak@linux.intel.com>
> >> >> >> > > > > >
> >> >> >> > > > > > When a page has the poison bit set replace the PTE with a poison entry.
> >> >> >> > > > > > This causes the right error handling to be done later when a process runs
> >> >> >> > > > > > into it.
> >> >> >> > > > > >
> >> >> >> > > > > > Also add a new flag to not do that (needed for the memory-failure handler
> >> >> >> > > > > > later)
> >> >> >> > > > > >
> >> >> >> > > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >> >> >> > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> >> >> >> > > > > >
> >> >> >> > > > > > ---
> >> >> >> > > > > >  include/linux/rmap.h |    1 +
> >> >> >> > > > > >  mm/rmap.c            |    9 ++++++++-
> >> >> >> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> >> >> >> > > > > >
> >> >> >> > > > > > --- sound-2.6.orig/mm/rmap.c
> >> >> >> > > > > > +++ sound-2.6/mm/rmap.c
> >> >> >> > > > > > @@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
> >> >> >> > > > > >        /* Update high watermark before we lower rss */
> >> >> >> > > > > >        update_hiwater_rss(mm);
> >> >> >> > > > > >
> >> >> >> > > > > > -       if (PageAnon(page)) {
> >> >> >> > > > > > +       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> >> >> >> > > > > > +               if (PageAnon(page))
> >> >> >> > > > > > +                       dec_mm_counter(mm, anon_rss);
> >> >> >> > > > > > +               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> >> >> >> > > > >
> >> >> >> > > > > Isn't it straightforward to use !is_hwpoison_entry ?
> >> >> >> > > >
> >> >> >> > > > Good catch!  It looks like a redundant check: the
> >> >> >> > > > page_check_address() at the beginning of the function guarantees that
> >> >> >> > > > !is_migration_entry() or !is_migration_entry() tests will all be TRUE.
> >> >> >> > > > So let's do this?
> >> >> >> > > It seems you expand my sight :)
> >> >> >> > >
> >> >> >> > > I don't know migration well.
> >> >> >> > > How page_check_address guarantee it's not migration entry ?
> >> >> >> >
> >> >> >> > page_check_address() calls pte_present() which returns the
> >> >> >> > (_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
> >> >> >> >
> >> >> >> > #define __swp_entry(type, offset)       ((swp_entry_t) { \
> >> >> >> >                                          ((type) << (_PAGE_BIT_PRESENT + 1)) \
> >> >> >> >                                          | ((offset) << SWP_OFFSET_SHIFT) })
> >> >> >> >
> >> >> >> > where SWP_OFFSET_SHIFT is defined to the bigger one of
> >> >> >> > max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
> >> >> >> >
> >> >> >> > So __swp_entry(type, offset) := (type << 1) | (offset << 9)
> >> >> >> >
> >> >> >> > We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
> >> >> >> > _PAGE_PROTNONE will all be zero for swap entries.
> >> >> >> >
> >> >> >>
> >> >> >> Thanks for kind explanation :)
> >> >> >
> >> >> > You are welcome~
> >> >> >
> >> >> >> >
> >> >> >> > > In addtion, If the page is poison while we are going to
> >> >> >> > > migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
> >> >> >> > > file_rss ?
> >> >> >> >
> >> >> >> > It will die on trying to migrate the poisoned page so we don't care
> >> >> >> > the accounting. But normally the poisoned page shall already be
> >> >> >>
> >> >> >>
> >> >> >> Okay. then, how about this ?
> >> >> >> We should not increase file_rss on trying to migrate the poisoned page
> >> >> >>
> >> >> >> -               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> >> >> >> +               else if (!(PAGE_MIGRATION && migration))
> >> >> >
> >> >> > This is good if we are going to stop the hwpoison page from being
> >> >> > consumed by move_to_new_page(), but I highly doubt we'll ever add
> >> >> > PageHWPoison() checks into the migration code.
> >> >> >
> >> >> > Because this race window is small enough:
> >> >> >
> >> >> >        TestSetPageHWPoison(p);
> >> >> >                                   lock_page(page);
> >> >> >                                   try_to_unmap(page, TTU_MIGRATION|...);
> >> >> >        lock_page_nosync(p);
> >> >> >
> >> >> > such small race windows can be found all over the kernel, it's just
> >> >> > insane to try to fix any of them.
> >> >>
> >> >> Sorry for too late response.
> >> >>
> >> >> I see your point.
> >> >> My opinion is that at least we must be notified when such situation happen.
> >> >> So I think it would be better to add some warning to fix up it when it
> >> >> happen even thought  it is small race window.
> >> >
> >> > Notification is also pointless here: we'll die hard on
> >> > accessing/consuming the poisoned page anyway :(
> >>
> >> My intention wasn't to recover it.
> >
> > Yes, that's not the point.
> >
> >> It just add something like WARN_ON.
> >> You said it is small window enough. but I think it can happen more
> >> hight probability in migration-workload.(At a moment, I don't know
> >> what kinds of app)
> >> For such case, If we can hear reporting of warning, at that time we
> >> can consider migration handling for HWPoison.
> >
> > The point is, any page can go corrupted any time. We don't need to add
> > 1000 PageHWPoison() tests in the kernel like this. We don't aim for
> > 100% protection, that's impossible. I'd be very contented if ever it
> > can reach 80% coverage :)
> 
> Okay.
> If it is your goal, I also think migration portion of all is very small.
> Thanks for kind reply for my boring discussion.
> 
> Reviewed-by : Minchan Kim <minchan.kim@gmail.com>

Thank you, I'll add comments to clearly state that goal and its rational :)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler
  2009-06-17  8:04                   ` Nick Piggin
  2009-06-17  9:55                     ` Wu Fengguang
@ 2009-06-18  9:56                     ` Wu Fengguang
  1 sibling, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-18  9:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Balbir Singh, Andrew Morton, LKML, Ingo Molnar,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 04:04:04PM +0800, Nick Piggin wrote:
> On Wed, Jun 17, 2009 at 02:37:02PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 10:22:25PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> > > > On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > > > > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > > > > For KVM you need early kill, for the others it remains to be seen.
> > > > > > 
> > > > > > Right. It's almost like you need to do a per-process thing, and
> > > > > > those that can handle things (such as the new SIGBUS or the new
> > > > > > EIO) could get those, and others could be killed.
> > > > > 
> > > > > To send early SIGBUS kills to processes who has called
> > > > > sigaction(SIGBUS, ...)?  KVM will sure do that. For other apps we
> > > > > don't mind they can understand that signal at all.
> > > > 
> > > > For apps that hook into SIGBUS for some other means and
> > > 
> > > Yes I was referring to the sigaction(SIGBUS) apps, others will
> > > be late killed anyway.
> > > 
> > > > do not understand the new type of SIGBUS signal? What about
> > > > those?
> > > 
> > > We introduced two new SIGBUS codes:
> > >         BUS_MCEERR_AO=5         for early kill
> > >         BUS_MCEERR_AR=4         for late  kill
> > > I'd assume a legacy application will handle them in the same way (both
> > > are unexpected code to the application).
> > > 
> > > We don't care whether the application can be killed by BUS_MCEERR_AO
> > > or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
> > > But (in the rare case) if the handler
> > > - refused to die on BUS_MCEERR_AR, it may create a busy loop and
> > >   flooding of SIGBUS signals, which is a bug of the application.
> > >   BUS_MCEERR_AO is one time and won't lead to busy loops.
> > > - does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
> > >   it may well hurt the same way on BUS_MCEERR_AR. The latter one is
> > >   unavoidable, so the application must be fixed anyway.
> > 
> > This patch materializes the automatically early kill idea.
> > It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
> > 
> > This is mainly a policy change, please comment.
> 
> Well then you can still early-kill random apps that did not
> want it, and you may still cause problems if its sigbus
> handler does something nontrivial.
> 
> Can you use a prctl or something so it can expclitly
> register interest in this?

OK, this patch allows one to request early kill by calling
prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...).

Now either app or admin can choose to enable/disable early kill on
a per-process basis. But still, an admin won't be able to change the
behavior of an application who calls prctl() to set the option by itself.

Thanks,
Fengguang

---
 include/linux/prctl.h |    6 ++++++
 include/linux/sched.h |    1 +
 kernel/sys.c          |    6 ++++++
 mm/memory-failure.c   |   12 ++++++++++--
 4 files changed, 23 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/prctl.h
+++ sound-2.6/include/linux/prctl.h
@@ -88,4 +88,10 @@
 #define PR_TASK_PERF_COUNTERS_DISABLE		31
 #define PR_TASK_PERF_COUNTERS_ENABLE		32
 
+/*
+ * Send early SIGBUS.BUS_MCEERR_AO notification on memory corruption?
+ * Useful for KVM and mission critical apps.
+ */
+#define PR_MEMORY_FAILURE_EARLY_KILL		33
+
 #endif /* _LINUX_PRCTL_H */
--- sound-2.6.orig/include/linux/sched.h
+++ sound-2.6/include/linux/sched.h
@@ -1666,6 +1666,7 @@ extern cputime_t task_gtime(struct task_
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_FLUSHER	0x00001000	/* responsible for disk writeback */
 #define PF_USED_MATH	0x00002000	/* if unset the fpu must be initialized before use */
+#define PF_EARLY_KILL	0x00004000	/* early kill me on memory failure */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
--- sound-2.6.orig/kernel/sys.c
+++ sound-2.6/kernel/sys.c
@@ -1545,6 +1545,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 				current->timer_slack_ns = arg2;
 			error = 0;
 			break;
+		case PR_MEMORY_FAILURE_EARLY_KILL:
+			if (arg2)
+				me->flags |= PF_EARLY_KILL;
+			else
+				me->flags &= ~PF_EARLY_KILL;
+			break;
 		default:
 			error = -EINVAL;
 			break;
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -205,6 +205,14 @@ static void kill_procs_ao(struct list_he
 	}
 }
 
+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+	if (!tsk->mm)
+		return false;
+
+	return tsk->flags & PF_EARLY_KILL;
+}
+
 /*
  * Collect processes when the error hit an anonymous page.
  */
@@ -222,7 +230,7 @@ static void collect_procs_anon(struct pa
 		goto out;
 
 	for_each_process (tsk) {
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
 			if (!page_mapped_in_vma(page, vma))
@@ -262,7 +270,7 @@ static void collect_procs_file(struct pa
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
       [not found]                   ` <28c262360906170703h3363b68dp74471358f647921e@mail.gmail.com>
@ 2009-06-18 12:14                     ` Wu Fengguang
  2009-06-18 13:31                       ` Minchan Kim
  0 siblings, 1 reply; 79+ messages in thread
From: Wu Fengguang @ 2009-06-18 12:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Wed, Jun 17, 2009 at 10:03:37PM +0800, Minchan Kim wrote:
> On Wed, Jun 17, 2009 at 10:55 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Jun 17, 2009 at 09:44:39PM +0800, Minchan Kim wrote:
> >> It is private mail for my question.
> >> I don't want to make noise in LKML.
> >> And I don't want to disturb your progress to merge HWPoison.
> >>
> >> > Because this race window is small enough:
> >> >
> >> >        TestSetPageHWPoison(p);
> >> >                                   lock_page(page);
> >> >                                   try_to_unmap(page, TTU_MIGRATION|...);
> >> >        lock_page_nosync(p);
> >> >
> >> > such small race windows can be found all over the kernel, it's just
> >> > insane to try to fix any of them.
> >>
> >> I don't know there are intentional small race windows in kernel until you said.
> >> I thought kernel code is perfect so it wouldn't allow race window
> >> although it is very small. But you pointed out. Until now, My thought
> >> is wrong.
> >>
> >> Do you know else small race windows by intention ?
> >> If you know it, tell me, please. It can expand my sight. :)
> >
> > The memory failure code does not aim to rescue 100% page corruptions.
> > That's unreasonable goal - the kernel pages, slab pages (including the
> > big dcache/icache) are almost impossible to isolate.
> >
> > Comparing to the big slab pools, the migration and other race windows are
> > really too small to care about :)
> 
> Also, If you will mention this contents as annotation, I will add my
> review sign.

Good suggestion. Here is a patch for comment updates.

> Thanks for kind reply for my boring discussion.

Boring? Not at all :)

Thanks,
Fengguang

---
 mm/memory-failure.c |   76 +++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 29 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -1,4 +1,8 @@
 /*
+ * linux/mm/memory-failure.c
+ *
+ * High level machine check handler.
+ *
  * Copyright (C) 2008, 2009 Intel Corporation
  * Authors: Andi Kleen, Fengguang Wu
  *
@@ -6,29 +10,36 @@
  * the GNU General Public License ("GPL") version 2 only as published by the
  * Free Software Foundation.
  *
- * High level machine check handler. Handles pages reported by the
- * hardware as being corrupted usually due to a 2bit ECC memory or cache
- * failure.
- *
- * This focuses on pages detected as corrupted in the background.
- * When the current CPU tries to consume corruption the currently
- * running process can just be killed directly instead. This implies
- * that if the error cannot be handled for some reason it's safe to
- * just ignore it because no corruption has been consumed yet. Instead
- * when that happens another machine check will happen.
- *
- * Handles page cache pages in various states.	The tricky part
- * here is that we can access any page asynchronous to other VM
- * users, because memory failures could happen anytime and anywhere,
- * possibly violating some of their assumptions. This is why this code
- * has to be extremely careful. Generally it tries to use normal locking
- * rules, as in get the standard locks, even if that means the
- * error handling takes potentially a long time.
- *
- * The operation to map back from RMAP chains to processes has to walk
- * the complete process list and has non linear complexity with the number
- * mappings. In short it can be quite slow. But since memory corruptions
- * are rare we hope to get away with this.
+ * Pages are reported by the hardware as being corrupted usually due to a
+ * 2bit ECC memory or cache failure. Machine check can either be raised when
+ * corruption is found in background memory scrubbing, or when someone tries to
+ * consume the corruption. This code focuses on the former case.  If it cannot
+ * handle the error for some reason it's safe to just ignore it because no
+ * corruption has been consumed yet. Instead when that happens another (deadly)
+ * machine check will happen.
+ *
+ * The tricky part here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere, possibly
+ * violating some of their assumptions. This is why this code has to be
+ * extremely careful. Generally it tries to use normal locking rules, as in get
+ * the standard locks, even if that means the error handling takes potentially
+ * a long time.
+ *
+ * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
+ * kernel text and slab pages (including the big dcache/icache) are almost
+ * impossible to isolate. We also try to keep the code clean by ignoring the
+ * other thousands of small corruption windows.
+ *
+ * When the corrupted page data is not recoverable, the tasks mapped the page
+ * have to be killed. We offer two kill options:
+ * - early kill with SIGBUS.BUS_MCEERR_AO (optional)
+ * - late  kill with SIGBUS.BUS_MCEERR_AR (mandatory)
+ * A task will be early killed as soon as corruption is found in its virtual
+ * address space, if it has called prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...);
+ * Any task will be late killed when it tries to access its corrupted virtual
+ * address. The early kill option offers KVM or other apps with large caches an
+ * opportunity to isolate the corrupted page from its internal cache, so as to
+ * avoid being late killed.
  */
 
 /*
@@ -275,6 +286,12 @@ static void collect_procs_file(struct pa
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
 				      pgoff)
+			/*
+			 * Send early kill signal to tasks whose vma covers
+			 * the page but not necessarily mapped it in its pte.
+			 * Applications who requested early kill normally want
+			 * to be informed of such data corruptions.
+			 */
 			if (vma->vm_mm == tsk->mm)
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 	}
@@ -284,6 +301,12 @@ static void collect_procs_file(struct pa
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare and only tasks flagged PF_EARLY_KILL will be searched, we hope to
+ * get away with this.
  */
 static void collect_procs(struct page *page, struct list_head *tokill)
 {
@@ -439,7 +462,7 @@ static int me_pagecache_dirty(struct pag
  * Dirty swap cache page is tricky to handle. The page could live both in page
  * cache and swap cache(ie. page is freshly swapped in). So it could be
  * referenced concurrently by 2 types of PTEs:
- * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling
  * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
  * and then
  *      - clear dirty bit to prevent IO
@@ -647,11 +670,6 @@ static void hwpoison_user_mappings(struc
 	 * mapped.  This has to be done before try_to_unmap,
 	 * because ttu takes the rmap data structures down.
 	 *
-	 * This also has the side effect to propagate the dirty
-	 * bit from PTEs into the struct page. This is needed
-	 * to actually decide if something needs to be killed
-	 * or errored, or if it's ok to just drop the page.
-	 *
 	 * Error handling: We ignore errors here because
 	 * there's nothing that can be done.
 	 */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in  try_to_unmap
  2009-06-18 12:14                     ` Wu Fengguang
@ 2009-06-18 13:31                       ` Minchan Kim
  2009-06-19  1:58                         ` Wu Fengguang
  0 siblings, 1 reply; 79+ messages in thread
From: Minchan Kim @ 2009-06-18 13:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Thu, Jun 18, 2009 at 9:14 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Wed, Jun 17, 2009 at 10:03:37PM +0800, Minchan Kim wrote:
>> On Wed, Jun 17, 2009 at 10:55 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Wed, Jun 17, 2009 at 09:44:39PM +0800, Minchan Kim wrote:
>> >> It is private mail for my question.
>> >> I don't want to make noise in LKML.
>> >> And I don't want to disturb your progress to merge HWPoison.
>> >>
>> >> > Because this race window is small enough:
>> >> >
>> >> >        TestSetPageHWPoison(p);
>> >> >                                   lock_page(page);
>> >> >                                   try_to_unmap(page, TTU_MIGRATION|...);
>> >> >        lock_page_nosync(p);
>> >> >
>> >> > such small race windows can be found all over the kernel, it's just
>> >> > insane to try to fix any of them.
>> >>
>> >> I don't know there are intentional small race windows in kernel until you said.
>> >> I thought kernel code is perfect so it wouldn't allow race window
>> >> although it is very small. But you pointed out. Until now, My thought
>> >> is wrong.
>> >>
>> >> Do you know else small race windows by intention ?
>> >> If you know it, tell me, please. It can expand my sight. :)
>> >
>> > The memory failure code does not aim to rescue 100% page corruptions.
>> > That's unreasonable goal - the kernel pages, slab pages (including the
>> > big dcache/icache) are almost impossible to isolate.
>> >
>> > Comparing to the big slab pools, the migration and other race windows are
>> > really too small to care about :)
>>
>> Also, If you will mention this contents as annotation, I will add my
>> review sign.
>
> Good suggestion. Here is a patch for comment updates.
>
>> Thanks for kind reply for my boring discussion.
>
> Boring? Not at all :)
>
> Thanks,
> Fengguang
>
> ---
>  mm/memory-failure.c |   76 +++++++++++++++++++++++++-----------------
>  1 file changed, 47 insertions(+), 29 deletions(-)
>
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -1,4 +1,8 @@
>  /*
> + * linux/mm/memory-failure.c
> + *
> + * High level machine check handler.
> + *
>  * Copyright (C) 2008, 2009 Intel Corporation
>  * Authors: Andi Kleen, Fengguang Wu
>  *
> @@ -6,29 +10,36 @@
>  * the GNU General Public License ("GPL") version 2 only as published by the
>  * Free Software Foundation.
>  *
> - * High level machine check handler. Handles pages reported by the
> - * hardware as being corrupted usually due to a 2bit ECC memory or cache
> - * failure.
> - *
> - * This focuses on pages detected as corrupted in the background.
> - * When the current CPU tries to consume corruption the currently
> - * running process can just be killed directly instead. This implies
> - * that if the error cannot be handled for some reason it's safe to
> - * just ignore it because no corruption has been consumed yet. Instead
> - * when that happens another machine check will happen.
> - *
> - * Handles page cache pages in various states. The tricky part
> - * here is that we can access any page asynchronous to other VM
> - * users, because memory failures could happen anytime and anywhere,
> - * possibly violating some of their assumptions. This is why this code
> - * has to be extremely careful. Generally it tries to use normal locking
> - * rules, as in get the standard locks, even if that means the
> - * error handling takes potentially a long time.
> - *
> - * The operation to map back from RMAP chains to processes has to walk
> - * the complete process list and has non linear complexity with the number
> - * mappings. In short it can be quite slow. But since memory corruptions
> - * are rare we hope to get away with this.
> + * Pages are reported by the hardware as being corrupted usually due to a
> + * 2bit ECC memory or cache failure. Machine check can either be raised when
> + * corruption is found in background memory scrubbing, or when someone tries to
> + * consume the corruption. This code focuses on the former case.  If it cannot
> + * handle the error for some reason it's safe to just ignore it because no
> + * corruption has been consumed yet. Instead when that happens another (deadly)
> + * machine check will happen.
> + *
> + * The tricky part here is that we can access any page asynchronous to other VM
> + * users, because memory failures could happen anytime and anywhere, possibly
> + * violating some of their assumptions. This is why this code has to be
> + * extremely careful. Generally it tries to use normal locking rules, as in get
> + * the standard locks, even if that means the error handling takes potentially
> + * a long time.
> + *
> + * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
> + * kernel text and slab pages (including the big dcache/icache) are almost
> + * impossible to isolate. We also try to keep the code clean by ignoring the
> + * other thousands of small corruption windows.

other thousands of small corruption windows(ex, migration, ...)
As far as you know , please write down them.

Anyway, I already added my sign.
Thanks for your effort never get exhausted. :)


-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-18 13:31                       ` Minchan Kim
@ 2009-06-19  1:58                         ` Wu Fengguang
  0 siblings, 0 replies; 79+ messages in thread
From: Wu Fengguang @ 2009-06-19  1:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Andi Kleen, Ingo Molnar, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Thu, Jun 18, 2009 at 09:31:52PM +0800, Minchan Kim wrote:
> On Thu, Jun 18, 2009 at 9:14 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Wed, Jun 17, 2009 at 10:03:37PM +0800, Minchan Kim wrote:
> >> On Wed, Jun 17, 2009 at 10:55 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Wed, Jun 17, 2009 at 09:44:39PM +0800, Minchan Kim wrote:
> >> >> It is private mail for my question.
> >> >> I don't want to make noise in LKML.
> >> >> And I don't want to disturb your progress to merge HWPoison.
> >> >>
> >> >> > Because this race window is small enough:
> >> >> >
> >> >> >        TestSetPageHWPoison(p);
> >> >> >                                   lock_page(page);
> >> >> >                                   try_to_unmap(page, TTU_MIGRATION|...);
> >> >> >        lock_page_nosync(p);
> >> >> >
> >> >> > such small race windows can be found all over the kernel, it's just
> >> >> > insane to try to fix any of them.
> >> >>
> >> >> I don't know there are intentional small race windows in kernel until you said.
> >> >> I thought kernel code is perfect so it wouldn't allow race window
> >> >> although it is very small. But you pointed out. Until now, My thought
> >> >> is wrong.
> >> >>
> >> >> Do you know else small race windows by intention ?
> >> >> If you know it, tell me, please. It can expand my sight. :)
> >> >
> >> > The memory failure code does not aim to rescue 100% page corruptions.
> >> > That's unreasonable goal - the kernel pages, slab pages (including the
> >> > big dcache/icache) are almost impossible to isolate.
> >> >
> >> > Comparing to the big slab pools, the migration and other race windows are
> >> > really too small to care about :)
> >>
> >> Also, If you will mention this contents as annotation, I will add my
> >> review sign.
> >
> > Good suggestion. Here is a patch for comment updates.
> >
> >> Thanks for kind reply for my boring discussion.
> >
> > Boring? Not at all :)
> >
> > Thanks,
> > Fengguang
> >
> > ---
> >  mm/memory-failure.c |   76 +++++++++++++++++++++++++-----------------
> >  1 file changed, 47 insertions(+), 29 deletions(-)
> >
> > --- sound-2.6.orig/mm/memory-failure.c
> > +++ sound-2.6/mm/memory-failure.c
> > @@ -1,4 +1,8 @@
> >  /*
> > + * linux/mm/memory-failure.c
> > + *
> > + * High level machine check handler.
> > + *
> >  * Copyright (C) 2008, 2009 Intel Corporation
> >  * Authors: Andi Kleen, Fengguang Wu
> >  *
> > @@ -6,29 +10,36 @@
> >  * the GNU General Public License ("GPL") version 2 only as published by the
> >  * Free Software Foundation.
> >  *
> > - * High level machine check handler. Handles pages reported by the
> > - * hardware as being corrupted usually due to a 2bit ECC memory or cache
> > - * failure.
> > - *
> > - * This focuses on pages detected as corrupted in the background.
> > - * When the current CPU tries to consume corruption the currently
> > - * running process can just be killed directly instead. This implies
> > - * that if the error cannot be handled for some reason it's safe to
> > - * just ignore it because no corruption has been consumed yet. Instead
> > - * when that happens another machine check will happen.
> > - *
> > - * Handles page cache pages in various states. The tricky part
> > - * here is that we can access any page asynchronous to other VM
> > - * users, because memory failures could happen anytime and anywhere,
> > - * possibly violating some of their assumptions. This is why this code
> > - * has to be extremely careful. Generally it tries to use normal locking
> > - * rules, as in get the standard locks, even if that means the
> > - * error handling takes potentially a long time.
> > - *
> > - * The operation to map back from RMAP chains to processes has to walk
> > - * the complete process list and has non linear complexity with the number
> > - * mappings. In short it can be quite slow. But since memory corruptions
> > - * are rare we hope to get away with this.
> > + * Pages are reported by the hardware as being corrupted usually due to a
> > + * 2bit ECC memory or cache failure. Machine check can either be raised when
> > + * corruption is found in background memory scrubbing, or when someone tries to
> > + * consume the corruption. This code focuses on the former case.  If it cannot
> > + * handle the error for some reason it's safe to just ignore it because no
> > + * corruption has been consumed yet. Instead when that happens another (deadly)
> > + * machine check will happen.
> > + *
> > + * The tricky part here is that we can access any page asynchronous to other VM
> > + * users, because memory failures could happen anytime and anywhere, possibly
> > + * violating some of their assumptions. This is why this code has to be
> > + * extremely careful. Generally it tries to use normal locking rules, as in get
> > + * the standard locks, even if that means the error handling takes potentially
> > + * a long time.
> > + *
> > + * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
> > + * kernel text and slab pages (including the big dcache/icache) are almost
> > + * impossible to isolate. We also try to keep the code clean by ignoring the
> > + * other thousands of small corruption windows.
> 
> other thousands of small corruption windows(ex, migration, ...)
> As far as you know , please write down them.

Like this:

        new_page = alloc_page();
        <small corruption window>
        write to new_page
        <small corruption window>
        read from new_page

> Anyway, I already added my sign.
> Thanks for your effort never get exhausted. :)

You are welcome :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2009-06-19  2:44 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-15  2:45 [PATCH 00/22] HWPOISON: Intro (v5) Wu Fengguang
2009-06-15  2:45 ` [PATCH 01/22] HWPOISON: Add page flag for poisoned pages Wu Fengguang
2009-06-15  2:45 ` [PATCH 02/22] HWPOISON: Export some rmap vma locking to outside world Wu Fengguang
2009-06-15  2:45 ` [PATCH 03/22] HWPOISON: Add support for poison swap entries v2 Wu Fengguang
2009-06-15  2:45 ` [PATCH 04/22] HWPOISON: Add new SIGBUS error codes for hardware poison signals Wu Fengguang
2009-06-15  2:45 ` [PATCH 05/22] HWPOISON: Add basic support for poisoned pages in fault handler v3 Wu Fengguang
2009-06-15  2:45 ` [PATCH 06/22] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 Wu Fengguang
2009-06-15  2:45 ` [PATCH 07/22] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-15  2:45 ` [PATCH 08/22] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Wu Fengguang
2009-06-15  2:45 ` [PATCH 09/22] HWPOISON: Handle hardware poisoned pages in try_to_unmap Wu Fengguang
2009-06-15 13:09   ` Minchan Kim
2009-06-15 15:26     ` Wu Fengguang
2009-06-16  0:03       ` Minchan Kim
2009-06-16 13:49         ` Wu Fengguang
2009-06-17  0:28           ` Minchan Kim
2009-06-17  7:23             ` Wu Fengguang
2009-06-17 13:27               ` Minchan Kim
2009-06-17 13:37                 ` Wu Fengguang
2009-06-17 13:43                   ` Minchan Kim
2009-06-17 14:03                     ` Wu Fengguang
2009-06-17 14:08                       ` Minchan Kim
2009-06-17 14:12                         ` Wu Fengguang
     [not found]               ` <28c262360906170644w65c08a8y2d2805fb08045804@mail.gmail.com>
     [not found]                 ` <20090617135543.GA8079@localhost>
     [not found]                   ` <28c262360906170703h3363b68dp74471358f647921e@mail.gmail.com>
2009-06-18 12:14                     ` Wu Fengguang
2009-06-18 13:31                       ` Minchan Kim
2009-06-19  1:58                         ` Wu Fengguang
2009-06-15  2:45 ` [PATCH 10/22] HWPOISON: check and isolate corrupted free pages v2 Wu Fengguang
2009-06-15  9:41   ` KAMEZAWA Hiroyuki
2009-06-15 10:16     ` Wu Fengguang
2009-06-15 23:52       ` KAMEZAWA Hiroyuki
2009-06-16  0:34         ` Wu Fengguang
2009-06-16 11:29           ` Hugh Dickins
2009-06-16 11:40             ` Wu Fengguang
2009-06-15  2:45 ` [PATCH 11/22] HWPOISON: Refactor truncate to allow direct truncating of page v3 Wu Fengguang
2009-06-15  2:45 ` [PATCH 12/22] HWPOISON: The high level memory error handler in the VM v7 Wu Fengguang
2009-06-15  2:45 ` [PATCH 13/22] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Wu Fengguang
2009-06-15  2:45 ` [PATCH 14/22] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Wu Fengguang
2009-06-15  2:45 ` [PATCH 15/22] HWPOISON: early kill cleanups and fixes Wu Fengguang
2009-06-15  2:45 ` [PATCH 16/22] mm: move page flag numbers for user space to page-flags.h Wu Fengguang
2009-06-15  2:45 ` [PATCH 17/22] HWPOISON: introduce struct hwpoison_control Wu Fengguang
2009-06-15  2:45 ` [PATCH 18/22] HWPOISON: use compound head page Wu Fengguang
2009-06-15  2:45 ` [PATCH 19/22] HWPOISON: detect free buddy pages explicitly Wu Fengguang
2009-06-15  2:45 ` [PATCH 20/22] HWPOISON: collect infos that reflect the impact of the memory corruption Wu Fengguang
2009-06-15  2:45 ` [PATCH 21/22] HWPOISON: send uevent to report " Wu Fengguang
2009-06-15  6:29   ` Andi Kleen
2009-06-15  9:56     ` Wu Fengguang
2009-06-16  0:35   ` Greg KH
2009-06-15  2:45 ` [PATCH 22/22] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Wu Fengguang
2009-06-15  3:18 ` [PATCH 00/22] HWPOISON: Intro (v5) Balbir Singh
2009-06-15  4:27   ` Wu Fengguang
2009-06-15  6:44     ` Nick Piggin
2009-06-15  7:09       ` Andi Kleen
2009-06-15  7:19         ` Nick Piggin
2009-06-15 12:10           ` Wu Fengguang
2009-06-15 12:25             ` Nick Piggin
2009-06-15 14:22               ` Wu Fengguang
2009-06-17  6:37                 ` [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler Wu Fengguang
2009-06-17  8:04                   ` Nick Piggin
2009-06-17  9:55                     ` Wu Fengguang
2009-06-17 10:00                       ` Nick Piggin
2009-06-17 11:56                         ` Wu Fengguang
2009-06-18  9:56                     ` Wu Fengguang
2009-06-15  8:14       ` [PATCH 00/22] HWPOISON: Intro (v5) Nick Piggin
2009-06-15 10:09         ` Wu Fengguang
2009-06-15 10:36           ` Nick Piggin
2009-06-15 11:41             ` Wu Fengguang
2009-06-15 12:51     ` Hugh Dickins
2009-06-15 13:00       ` Alan Cox
2009-06-15 13:29         ` Andi Kleen
2009-06-15 13:28           ` H. Peter Anvin
2009-06-15 14:48           ` Alan Cox
2009-06-15 15:24             ` Andi Kleen
2009-06-15 15:28               ` Alan Cox
2009-06-15 16:19                 ` Andi Kleen
2009-06-15 16:28                   ` Alan Cox
2009-06-15 17:07                     ` Andi Kleen
2009-06-16 19:44           ` Russ Anderson
2009-06-16 20:28             ` H. Peter Anvin
2009-06-16 20:54               ` Russ Anderson
2009-06-16 20:58                 ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).