All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] HWPOISON: Intro (v6)
@ 2009-06-20  3:16 ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Minchan Kim, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

Hi Andrew,

Now we default to the less intrusive late kill, together with a per process
tunable option to early kill as proposed by Nick and Hugh. It also invalidates
the mapping page in a safe manner and stops messing with dirty/writeback pages
for now.

Comments are welcome!

changes since v5:
- interface: remove the global vm.memory_failure_early_kill;
  introduce prctl(PR_MEMORY_FAILURE_EARLY_KILL)
- early kill: check for page_mapped_in_vma() on anon pages
- early kill: test and use page->mapping instead of page_mapping()
- be safe: introduce invalidate_inode_page() and don't remove dirty/writeback
  pages from page cache
- account all poisoned pages in mce_bad_pages
- taint kernel on hwpoison pages in bad_page()
- comment updates
- don't do uevent for now
- don't mess with dirty/writeback pages for now

changes since v4:
- new feature: uevent report (and collect various states for that)
- early kill cleanups
- remove unnecessary code in __do_fault()
- fix compile error when feature not enabled
- fix kernel oops when invalid pfn is feed to corrupt-pfn
- make tasklist_lock/anon_vma locking order straight
- use the safer invalidate page for possible metadata pages

  [PATCH 01/15] HWPOISON: Add page flag for poisoned pages
  [PATCH 02/15] HWPOISON: Export some rmap vma locking to outside world
  [PATCH 03/15] HWPOISON: Add support for poison swap entries v2
  [PATCH 04/15] HWPOISON: Add new SIGBUS error codes for hardware poison signals
  [PATCH 05/15] HWPOISON: Add basic support for poisoned pages in fault handler v3
  [PATCH 06/15] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  [PATCH 07/15] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  [PATCH 08/15] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  [PATCH 09/15] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  [PATCH 10/15] HWPOISON: check and isolate corrupted free pages v3
  [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
  [PATCH 13/15] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
  [PATCH 14/15] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  [PATCH 15/15] HWPOISON: FOR TESTING: Enable memory failure code unconditionally

---
Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future.

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c,
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might.

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.
Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome.

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.
v3: Various updates, see changelogs in individual patches.
v4: Various updates, see changelogs in individual patches.

Thanks,
Fengguang
-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 00/15] HWPOISON: Intro (v6)
@ 2009-06-20  3:16 ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Minchan Kim, Mel Gorman, Wu, Fengguang,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Nick Piggin,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

Hi Andrew,

Now we default to the less intrusive late kill, together with a per process
tunable option to early kill as proposed by Nick and Hugh. It also invalidates
the mapping page in a safe manner and stops messing with dirty/writeback pages
for now.

Comments are welcome!

changes since v5:
- interface: remove the global vm.memory_failure_early_kill;
  introduce prctl(PR_MEMORY_FAILURE_EARLY_KILL)
- early kill: check for page_mapped_in_vma() on anon pages
- early kill: test and use page->mapping instead of page_mapping()
- be safe: introduce invalidate_inode_page() and don't remove dirty/writeback
  pages from page cache
- account all poisoned pages in mce_bad_pages
- taint kernel on hwpoison pages in bad_page()
- comment updates
- don't do uevent for now
- don't mess with dirty/writeback pages for now

changes since v4:
- new feature: uevent report (and collect various states for that)
- early kill cleanups
- remove unnecessary code in __do_fault()
- fix compile error when feature not enabled
- fix kernel oops when invalid pfn is feed to corrupt-pfn
- make tasklist_lock/anon_vma locking order straight
- use the safer invalidate page for possible metadata pages

  [PATCH 01/15] HWPOISON: Add page flag for poisoned pages
  [PATCH 02/15] HWPOISON: Export some rmap vma locking to outside world
  [PATCH 03/15] HWPOISON: Add support for poison swap entries v2
  [PATCH 04/15] HWPOISON: Add new SIGBUS error codes for hardware poison signals
  [PATCH 05/15] HWPOISON: Add basic support for poisoned pages in fault handler v3
  [PATCH 06/15] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  [PATCH 07/15] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  [PATCH 08/15] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  [PATCH 09/15] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  [PATCH 10/15] HWPOISON: check and isolate corrupted free pages v3
  [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
  [PATCH 13/15] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
  [PATCH 14/15] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  [PATCH 15/15] HWPOISON: FOR TESTING: Enable memory failure code unconditionally

---
Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future.

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c,
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might.

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.
Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome.

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.
v3: Various updates, see changelogs in individual patches.
v4: Various updates, see changelogs in individual patches.

Thanks,
Fengguang
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/15] HWPOISON: Add page flag for poisoned pages
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Christoph Lameter, Andi Kleen, Ingo Molnar, Minchan Kim,
	Mel Gorman, Wu, Fengguang, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: page-flag-poison --]
[-- Type: text/plain, Size: 2585 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)
v3: Define TestSetHWPoison on !CONFIG_MEMORY_FAILURE (Fengguang)

Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/linux/page-flags.h
+++ sound-2.6/include/linux/page-flags.h
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -102,6 +105,9 @@ enum pageflags {
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_hwpoison,		/* hardware poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -182,6 +188,9 @@ static inline void ClearPage##uname(stru
 #define __CLEARPAGEFLAG_NOOP(uname)					\
 static inline void __ClearPage##uname(struct page *page) {  }
 
+#define TESTSETFLAG_FALSE(uname)					\
+static inline int TestSetPage##uname(struct page *page) { return 0; }
+
 #define TESTCLEARFLAG_FALSE(uname)					\
 static inline int TestClearPage##uname(struct page *page) { return 0; }
 
@@ -265,6 +274,16 @@ PAGEFLAG(Uncached, uncached)
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+TESTSETFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -389,7 +408,7 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/15] HWPOISON: Add page flag for poisoned pages
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Christoph Lameter, Andi Kleen, Ingo Molnar, Minchan Kim,
	Mel Gorman, Wu, Fengguang, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: page-flag-poison --]
[-- Type: text/plain, Size: 2810 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)
v3: Define TestSetHWPoison on !CONFIG_MEMORY_FAILURE (Fengguang)

Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/linux/page-flags.h
+++ sound-2.6/include/linux/page-flags.h
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -102,6 +105,9 @@ enum pageflags {
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_hwpoison,		/* hardware poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -182,6 +188,9 @@ static inline void ClearPage##uname(stru
 #define __CLEARPAGEFLAG_NOOP(uname)					\
 static inline void __ClearPage##uname(struct page *page) {  }
 
+#define TESTSETFLAG_FALSE(uname)					\
+static inline int TestSetPage##uname(struct page *page) { return 0; }
+
 #define TESTCLEARFLAG_FALSE(uname)					\
 static inline int TestClearPage##uname(struct page *page) { return 0; }
 
@@ -265,6 +274,16 @@ PAGEFLAG(Uncached, uncached)
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+TESTSETFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -389,7 +408,7 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 02/15] HWPOISON: Export some rmap vma locking to outside world
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: export-rmap --]
[-- Type: text/plain, Size: 1565 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -116,6 +116,12 @@ int try_to_munlock(struct page *);
 int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -191,7 +191,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@ out:
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 02/15] HWPOISON: Export some rmap vma locking to outside world
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: export-rmap --]
[-- Type: text/plain, Size: 1790 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -116,6 +116,12 @@ int try_to_munlock(struct page *);
 int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -191,7 +191,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@ out:
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 03/15] HWPOISON: Add support for poison swap entries v2
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: poison-swp-entry --]
[-- Type: text/plain, Size: 3701 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Memory migration uses special swap entry types to trigger special actions on 
page faults. Extend this mechanism to also support poisoned swap entries, to 
trigger poison handling on page faults. This allows follow-on patches to 
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

--- sound-2.6.orig/include/linux/swap.h
+++ sound-2.6/include/linux/swap.h
@@ -34,16 +34,38 @@ static inline int current_is_kswapd(void
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON		MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
--- sound-2.6.orig/include/linux/swapops.h
+++ sound-2.6/include/linux/swapops.h
@@ -131,3 +131,41 @@ static inline int is_write_migration_ent
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
--- sound-2.6.orig/mm/swapfile.c
+++ sound-2.6/mm/swapfile.c
@@ -697,7 +697,7 @@ int free_swap_and_cache(swp_entry_t entr
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -2083,7 +2083,7 @@ static int __swap_duplicate(swp_entry_t 
 	int count;
 	bool has_cache;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return -EINVAL;
 
 	type = swp_type(entry);

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 03/15] HWPOISON: Add support for poison swap entries v2
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: poison-swp-entry --]
[-- Type: text/plain, Size: 3926 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Memory migration uses special swap entry types to trigger special actions on 
page faults. Extend this mechanism to also support poisoned swap entries, to 
trigger poison handling on page faults. This allows follow-on patches to 
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

--- sound-2.6.orig/include/linux/swap.h
+++ sound-2.6/include/linux/swap.h
@@ -34,16 +34,38 @@ static inline int current_is_kswapd(void
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON		MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
--- sound-2.6.orig/include/linux/swapops.h
+++ sound-2.6/include/linux/swapops.h
@@ -131,3 +131,41 @@ static inline int is_write_migration_ent
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
--- sound-2.6.orig/mm/swapfile.c
+++ sound-2.6/mm/swapfile.c
@@ -697,7 +697,7 @@ int free_swap_and_cache(swp_entry_t entr
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -2083,7 +2083,7 @@ static int __swap_duplicate(swp_entry_t 
 	int count;
 	bool has_cache;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return -EINVAL;
 
 	type = swp_type(entry);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 04/15] HWPOISON: Add new SIGBUS error codes for hardware poison signals
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: poison-signal --]
[-- Type: text/plain, Size: 2654 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/asm-generic/siginfo.h
+++ sound-2.6/include/asm-generic/siginfo.h
@@ -82,6 +82,7 @@ typedef struct siginfo {
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@ typedef struct siginfo {
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@ typedef struct siginfo {
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 04/15] HWPOISON: Add new SIGBUS error codes for hardware poison signals
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: poison-signal --]
[-- Type: text/plain, Size: 2879 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/asm-generic/siginfo.h
+++ sound-2.6/include/asm-generic/siginfo.h
@@ -82,6 +82,7 @@ typedef struct siginfo {
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@ typedef struct siginfo {
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@ typedef struct siginfo {
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 05/15] HWPOISON: Add basic support for poisoned pages in fault handler v3
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: vm-fault-poison --]
[-- Type: text/plain, Size: 2658 bytes --]

From: Andi Kleen <ak@linux.intel.com>

- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   18 +++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

--- sound-2.6.orig/mm/memory.c
+++ sound-2.6/mm/memory.c
@@ -1315,7 +1315,8 @@ int __get_user_pages(struct task_struct 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2595,8 +2596,15 @@ static int do_swap_page(struct mm_struct
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_hwpoison_entry(entry)) {
+			ret = VM_FAULT_HWPOISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2620,6 +2628,10 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+		goto out;
 	}
 
 	lock_page(page);
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,12 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 05/15] HWPOISON: Add basic support for poisoned pages in fault handler v3
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: vm-fault-poison --]
[-- Type: text/plain, Size: 2883 bytes --]

From: Andi Kleen <ak@linux.intel.com>

- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   18 +++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

--- sound-2.6.orig/mm/memory.c
+++ sound-2.6/mm/memory.c
@@ -1315,7 +1315,8 @@ int __get_user_pages(struct task_struct 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2595,8 +2596,15 @@ static int do_swap_page(struct mm_struct
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_hwpoison_entry(entry)) {
+			ret = VM_FAULT_HWPOISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2620,6 +2628,10 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+		goto out;
 	}
 
 	lock_page(page);
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,12 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 06/15] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: x86-vmfault-poison --]
[-- Type: text/plain, Size: 2025 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

v2: Make the printk more verbose/unique

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -167,6 +167,7 @@ force_sig_info_fault(int si_signo, int s
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -798,10 +799,12 @@ out_of_memory(struct pt_regs *regs, unsi
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -817,7 +820,15 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_HWPOISON) {
+		printk(KERN_ERR
+	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+			tsk->comm, tsk->pid, address);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -827,8 +838,8 @@ mm_fault_error(struct pt_regs *regs, uns
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 06/15] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: x86-vmfault-poison --]
[-- Type: text/plain, Size: 2250 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

v2: Make the printk more verbose/unique

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -167,6 +167,7 @@ force_sig_info_fault(int si_signo, int s
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -798,10 +799,12 @@ out_of_memory(struct pt_regs *regs, unsi
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -817,7 +820,15 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_HWPOISON) {
+		printk(KERN_ERR
+	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+			tsk->comm, tsk->pid, address);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -827,8 +838,8 @@ mm_fault_error(struct pt_regs *regs, uns
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 07/15] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-remove-ifdef.patch --]
[-- Type: text/plain, Size: 1531 bytes --]

From: Wu Fengguang <fengguang.wu@intel.com>

So as to eliminate one #ifdef in the C source.

Proposed by Nick Piggin.

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 arch/x86/mm/fault.c |    3 +--
 include/linux/mm.h  |    7 ++++++-
 2 files changed, 7 insertions(+), 3 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -820,14 +820,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-#ifdef CONFIG_MEMORY_FAILURE
 	if (fault & VM_FAULT_HWPOISON) {
 		printk(KERN_ERR
 	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
 			tsk->comm, tsk->pid, address);
 		code = BUS_MCEERR_AR;
 	}
-#endif
+
 	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,16 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
-#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
+#ifdef CONFIG_MEMORY_FAILURE
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
+#else
+#define VM_FAULT_HWPOISON 0
+#endif
+
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 07/15] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-remove-ifdef.patch --]
[-- Type: text/plain, Size: 1756 bytes --]

From: Wu Fengguang <fengguang.wu@intel.com>

So as to eliminate one #ifdef in the C source.

Proposed by Nick Piggin.

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 arch/x86/mm/fault.c |    3 +--
 include/linux/mm.h  |    7 ++++++-
 2 files changed, 7 insertions(+), 3 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -820,14 +820,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-#ifdef CONFIG_MEMORY_FAILURE
 	if (fault & VM_FAULT_HWPOISON) {
 		printk(KERN_ERR
 	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
 			tsk->comm, tsk->pid, address);
 		code = BUS_MCEERR_AR;
 	}
-#endif
+
 	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,16 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
-#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
+#ifdef CONFIG_MEMORY_FAILURE
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
+#else
+#define VM_FAULT_HWPOISON 0
+#endif
+
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 08/15] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Lee.Schermerhorn, npiggin, Andi Kleen, Ingo Molnar,
	Minchan Kim, Mel Gorman, Wu, Fengguang, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: try-to-unmap-flags --]
[-- Type: text/plain, Size: 8566 bytes --]

From: Andi Kleen <ak@linux.intel.com>

try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.

This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -85,7 +85,19 @@ static inline void page_dup_rmap(struct 
  */
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -912,7 +912,7 @@ void page_remove_rmap(struct page *page)
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -934,11 +934,13 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -978,12 +980,12 @@ static int try_to_unmap_one(struct page 
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -1152,12 +1154,13 @@ static int try_to_mlock_page(struct page
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1173,7 +1176,7 @@ static int try_to_unmap_anon(struct page
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1197,8 +1200,7 @@ static int try_to_unmap_anon(struct page
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1210,7 +1212,7 @@ static int try_to_unmap_anon(struct page
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1222,6 +1224,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1234,7 +1237,7 @@ static int try_to_unmap_file(struct page
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1259,7 +1262,8 @@ static int try_to_unmap_file(struct page
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1293,7 +1297,7 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1333,7 +1337,7 @@ out:
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1344,16 +1348,16 @@ out:
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1378,8 +1382,8 @@ int try_to_munlock(struct page *page)
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
--- sound-2.6.orig/mm/vmscan.c
+++ sound-2.6/mm/vmscan.c
@@ -661,7 +661,7 @@ static unsigned long shrink_page_list(st
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
--- sound-2.6.orig/mm/migrate.c
+++ sound-2.6/mm/migrate.c
@@ -669,7 +669,7 @@ static int unmap_and_move(new_page_t get
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 08/15] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Lee.Schermerhorn, npiggin, Andi Kleen, Ingo Molnar,
	Minchan Kim, Mel Gorman, Wu, Fengguang, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: try-to-unmap-flags --]
[-- Type: text/plain, Size: 8791 bytes --]

From: Andi Kleen <ak@linux.intel.com>

try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.

This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -85,7 +85,19 @@ static inline void page_dup_rmap(struct 
  */
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -912,7 +912,7 @@ void page_remove_rmap(struct page *page)
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -934,11 +934,13 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -978,12 +980,12 @@ static int try_to_unmap_one(struct page 
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -1152,12 +1154,13 @@ static int try_to_mlock_page(struct page
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1173,7 +1176,7 @@ static int try_to_unmap_anon(struct page
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1197,8 +1200,7 @@ static int try_to_unmap_anon(struct page
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1210,7 +1212,7 @@ static int try_to_unmap_anon(struct page
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1222,6 +1224,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1234,7 +1237,7 @@ static int try_to_unmap_file(struct page
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1259,7 +1262,8 @@ static int try_to_unmap_file(struct page
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1293,7 +1297,7 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1333,7 +1337,7 @@ out:
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1344,16 +1348,16 @@ out:
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1378,8 +1382,8 @@ int try_to_munlock(struct page *page)
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
--- sound-2.6.orig/mm/vmscan.c
+++ sound-2.6/mm/vmscan.c
@@ -661,7 +661,7 @@ static unsigned long shrink_page_list(st
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
--- sound-2.6.orig/mm/migrate.c
+++ sound-2.6/mm/migrate.c
@@ -669,7 +669,7 @@ static int unmap_and_move(new_page_t get
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 09/15] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: try-to-unmap-poison --]
[-- Type: text/plain, Size: 1547 bytes --]

From: Andi Kleen <ak@linux.intel.com>

When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.

v2: add a new flag to not do that (needed for the memory-failure handler
later) (Fengguang)
v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan)

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_hwpoison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -94,6 +94,7 @@ enum ttu_flags {
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 09/15] HWPOISON: Handle hardware poisoned pages in try_to_unmap
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: try-to-unmap-poison --]
[-- Type: text/plain, Size: 1772 bytes --]

From: Andi Kleen <ak@linux.intel.com>

When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.

v2: add a new flag to not do that (needed for the memory-failure handler
later) (Fengguang)
v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan)

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_hwpoison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -94,6 +94,7 @@ enum ttu_flags {
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 10/15] HWPOISON: check and isolate corrupted free pages v3
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Andi Kleen, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: free-pages-poison --]
[-- Type: text/plain, Size: 2389 bytes --]

From: Wu Fengguang <fengguang.wu@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]

v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
v3: Fengguang:
But still taint the kernel: PG_hwpoison might be set by a software bug.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,10 @@ static void bad_page(struct page *page)
 	static unsigned long nr_shown;
 	static unsigned long nr_unshown;
 
+	/* Don't complain about poisoned pages */
+	if (PageHWPoison(page))
+		goto out;
+
 	/*
 	 * Allow a burst of 60 reports, then keep quiet for that minute;
 	 * or allow a steady drip of one report per second.
@@ -646,7 +650,7 @@ static inline void expand(struct zone *z
 /*
  * This page is about to be returned from the page allocator
  */
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static inline int check_new_page(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
@@ -655,6 +659,18 @@ static int prep_new_page(struct page *pa
 		bad_page(page);
 		return 1;
 	}
+	return 0;
+}
+
+static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+{
+	int i;
+
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+		if (unlikely(check_new_page(p)))
+			return 1;
+	}
 
 	set_page_private(page, 0);
 	set_page_refcounted(page);

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 10/15] HWPOISON: check and isolate corrupted free pages v3
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Andi Kleen, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: free-pages-poison --]
[-- Type: text/plain, Size: 2614 bytes --]

From: Wu Fengguang <fengguang.wu@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]

v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
v3: Fengguang:
But still taint the kernel: PG_hwpoison might be set by a software bug.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,10 @@ static void bad_page(struct page *page)
 	static unsigned long nr_shown;
 	static unsigned long nr_unshown;
 
+	/* Don't complain about poisoned pages */
+	if (PageHWPoison(page))
+		goto out;
+
 	/*
 	 * Allow a burst of 60 reports, then keep quiet for that minute;
 	 * or allow a steady drip of one report per second.
@@ -646,7 +650,7 @@ static inline void expand(struct zone *z
 /*
  * This page is about to be returned from the page allocator
  */
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static inline int check_new_page(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
@@ -655,6 +659,18 @@ static int prep_new_page(struct page *pa
 		bad_page(page);
 		return 1;
 	}
+	return 0;
+}
+
+static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+{
+	int i;
+
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+		if (unlikely(check_new_page(p)))
+			return 1;
+	}
 
 	set_page_private(page, 0);
 	set_page_refcounted(page);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, hugh.dickins, npiggin, chris.mason, Rik van Riel,
	Wu Fengguang, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andi Kleen,
	linux-mm

[-- Attachment #1: high-level --]
[-- Type: text/plain, Size: 33546 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focuses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before killing;
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with syscall prctl(PR_MEMORY_FAILURE_EARLY_KILL, 0/1, ...)
to be implemented in the next patch.  The default will be late kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu:
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)
v5: Improve various comments
Handle page_address_in_vma() failure better by SIGKILL and also
	make message debugging only
Clean up printks
Remove redundant PageWriteback check (Nick Piggin)
Add missing clear_page_mlock
Reformat state table to be <80 columns again
Use truncate helper instead of manual truncate in me_pagecache_*
Check for metadata buffer pages and reject them.
A few cleanups.
v6:
Fix a printk broken in the last round of cleanups.
More minor cleanups and fixes based on comments from FengguangWu.
Rename /proc/meminfo Header to "HardwareCorrupted"
Add a printk for the failed mapping case (Fengguang Wu)
Better clean page check (Fengguang Wu)
v7:
Fix kernel oops by action_result(invalid pfn) (Fengguang Wu)
make tasklist_lock/anon_vma locking order straight (Nick Piggin)
use the safer invalidate page for possible metadata pages (Nick Piggin)
v8:
check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)
test and use page->mapping instead of page_mapping() (Fengguang)
cleanup some early kill comments (Fengguang)
introduce invalidate_inode_page() and don't remove dirty/writeback pages
from page cache (Nick, Fengguang)
account all poisoned pages in mce_bad_pages (Fengguang)

Cc: hugh.dickins@tiscali.co.uk
Cc: npiggin@suse.de
Cc: chris.mason@oracle.com
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/meminfo.c    |    9 
 include/linux/mm.h   |    6 
 include/linux/rmap.h |    1 
 mm/Kconfig           |    3 
 mm/Makefile          |    1 
 mm/filemap.c         |    4 
 mm/memory-failure.c  |  776 +++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c            |    6 
 mm/truncate.c        |   23 -
 9 files changed, 821 insertions(+), 8 deletions(-)

--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -42,5 +42,6 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
--- /dev/null
+++ sound-2.6/mm/memory-failure.c
@@ -0,0 +1,776 @@
+/*
+ * linux/mm/memory-failure.c
+ *
+ * High level machine check handler.
+ *
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Authors: Andi Kleen, Fengguang Wu
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * Pages are reported by the hardware as being corrupted usually due to a
+ * 2bit ECC memory or cache failure. Machine check can either be raised when
+ * corruption is found in background memory scrubbing, or when someone tries to
+ * consume the corruption. This code focuses on the former case.  If it cannot
+ * handle the error for some reason it's safe to just ignore it because no
+ * corruption has been consumed yet. Instead when that happens another (deadly)
+ * machine check will happen.
+ *
+ * The tricky part here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere, possibly
+ * violating some of their assumptions. This is why this code has to be
+ * extremely careful. Generally it tries to use normal locking rules, as in get
+ * the standard locks, even if that means the error handling takes potentially
+ * a long time.
+ *
+ * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
+ * kernel text and slab pages (including the big dcache/icache) are almost
+ * impossible to isolate. We also try to keep the code clean by ignoring the
+ * other thousands of small corruption windows.
+ *
+ * When the corrupted page data is not recoverable, the tasks mapped the page
+ * have to be killed. We offer two kill options:
+ * - early kill with SIGBUS.BUS_MCEERR_AO (optional)
+ * - late  kill with SIGBUS.BUS_MCEERR_AR (mandatory)
+ * A task will be early killed as soon as corruption is found in its virtual
+ * address space, if it has called prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...);
+ * Tasks will always be late killed when it tries to access its corrupted
+ * virtual address. The early kill option offers KVM or other apps with large
+ * caches an opportunity to isolate the corrupted page from its internal cache,
+ * so as to avoid being late killed.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#define DEBUG 1
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+	"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
+		pfn, t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	/*
+	 * Don't use force here, it's convenient if the signal
+	 * can be temporarily blocked.
+	 * This could cause a loop when the user sets SIGBUS
+	 * to SIG_IGN, but hopefully noone will do that?
+	 */
+	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+	unsigned addr_valid:1;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.	We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+			struct vm_area_struct *vma,
+			struct list_head *to_kill,
+			struct to_kill **tkc)
+{
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR
+		"MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	tk->addr_valid = 1;
+
+	/*
+	 * In theory we don't have to kill when the page was
+	 * munmaped. But it could be also a mremap. Since that's
+	 * likely very rare kill anyways just out of paranoia, but use
+	 * a SIGKILL because the error is not contained anymore.
+	 */
+	if (tk->addr == -EFAULT) {
+		pr_debug("MCE: Unable to find user space address %lx in %s\n",
+			page_to_pfn(p), tsk->comm);
+		tk->addr_valid = 0;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+			  int fail, unsigned long pfn)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. Just kill it.
+			 * the signal handlers
+			 */
+			if (fail || tk->addr_valid == 0) {
+				printk(KERN_ERR
+"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+				force_sig(SIGKILL, tk->tsk);
+			}
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+					      pfn) < 0)
+				printk(KERN_ERR
+	"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			       struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av;
+
+	read_lock(&tasklist_lock);
+
+	av = page_lock_anon_vma(page);
+	if (av == NULL)	/* Not actually mapped anymore */
+		goto out;
+
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (!page_mapped_in_vma(page, vma))
+				continue;
+
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	page_unlock_anon_vma(av);
+out:
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			       struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page->mapping;
+
+	/*
+	 * A note on the locking order between the two locks.
+	 * We don't rely on this particular order.
+	 * If you have some other code that needs a different order
+	 * feel free to switch them around. Or add a reverse link
+	 * from mm_struct to task_struct, then this could be all
+	 * done without taking tasklist_lock and looping over all tasks.
+	 */
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			/*
+			 * Send early kill signal to tasks whose vma covers
+			 * the page but not necessarily mapped it in its pte.
+			 * Applications who requested early kill normally want
+			 * to be informed of such data corruptions.
+			 */
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare and only tasks flagged PF_EARLY_KILL will be searched, we hope to
+ * get away with this.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	/*
+	 * First preallocate one to_kill structure outside the spin locks,
+	 * so that we can kill at least one process reasonably reliable.
+	 */
+	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
+
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,		/* Error handling failed */
+	DELAYED,	/* Will be handled later */
+	IGNORED,	/* Error safely ignored */
+	RECOVERED,	/* Successfully recovered */
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p, unsigned long pfn)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p, unsigned long pfn)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p, unsigned long pfn)
+{
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p, unsigned long pfn)
+{
+	return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+	struct address_space *mapping;
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	mapping = page_mapping(p);
+	if (mapping == NULL)
+		return RECOVERED;
+
+	/*
+	 * Now remove it from page cache.
+	 * Currently we only remove clean, unused page for the sake of safety.
+	 */
+	if (!invalidate_inode_page(mapping, p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: failed to remove from page cache\n", pfn);
+		return FAILED;
+	}
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	if (mapping) {
+		/*
+		 * IO error will be reported by write(), fsync(), etc.
+		 * who check the mapping.
+		 * This way the application knows that something went
+		 * wrong with its dirty file data.
+		 *
+		 * There's one open issue:
+		 *
+		 * The EIO will be only reported on the next IO
+		 * operation and then cleared through the IO map.
+		 * Normally Linux has two mechanisms to pass IO error
+		 * first through the AS_EIO flag in the address space
+		 * and then through the PageError flag in the page.
+		 * Since we drop pages on memory failure handling the
+		 * only mechanism open to use is through AS_AIO.
+		 *
+		 * This has the disadvantage that it gets cleared on
+		 * the first operation that returns an error, while
+		 * the PageError bit is more sticky and only cleared
+		 * when the page is reread or dropped.  If an
+		 * application assumes it will always get error on
+		 * fsync, but does other operations on the fd before
+		 * and the page is dropped inbetween then the error
+		 * will not be properly reported.
+		 *
+		 * This can already happen even without hwpoisoned
+		 * pages: first on metadata IO errors (which only
+		 * report through AS_EIO) or when the page is dropped
+		 * at the wrong time.
+		 *
+		 * So right now we assume that the application DTRT on
+		 * the first EIO, but we're not worse than other parts
+		 * of the kernel.
+		 */
+		mapping_set_error(mapping, EIO);
+	}
+
+	return me_pagecache_clean(p, pfn);
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ *      - clear dirty bit to prevent IO
+ *      - remove from LRU
+ *      - but keep in the swap cache, so that when we return to it on
+ *        a later page fault, we know the application is accessing
+ *        corrupted data and shall be killed (we installed simple
+ *        interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+{
+	ClearPageDirty(p);
+	/* Trigger EIO in shmem: */
+	ClearPageUptodate(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p, unsigned long pfn)
+{
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	delete_from_swap_cache(p);
+
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p, unsigned long pfn)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty		(1UL << PG_dirty)
+#define sc		(1UL << PG_swapcache)
+#define unevict		(1UL << PG_unevictable)
+#define mlock		(1UL << PG_mlocked)
+#define writeback	(1UL << PG_writeback)
+#define lru		(1UL << PG_lru)
+#define swapbacked	(1UL << PG_swapbacked)
+#define head		(1UL << PG_head)
+#define tail		(1UL << PG_tail)
+#define compound	(1UL << PG_compound)
+#define slab		(1UL << PG_slab)
+#define buddy		(1UL << PG_buddy)
+#define reserved	(1UL << PG_reserved)
+
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p, unsigned long pfn);
+} error_states[] = {
+	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ buddy,	buddy,		"free kernel",	me_free },
+
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ slab,		slab,		"kernel slab",	me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ head,		head,		"huge",		me_huge_page },
+	{ tail,		tail,		"huge",		me_huge_page },
+#else
+	{ compound,	compound,	"huge",		me_huge_page },
+#endif
+
+	{ sc|dirty,	sc|dirty,	"swapcache",	me_swapcache_dirty },
+	{ sc|dirty,	sc,		"swapcache",	me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ unevict|dirty, unevict|dirty,	"unevictable LRU", me_pagecache_dirty},
+	{ unevict,	unevict,	"unevictable LRU", me_pagecache_clean},
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ mlock|dirty,	mlock|dirty,	"mlocked LRU",	me_pagecache_dirty },
+	{ mlock,	mlock,		"mlocked LRU",	me_pagecache_clean },
+#endif
+
+	{ lru|dirty,	lru|dirty,	"LRU",		me_pagecache_dirty },
+	{ lru|dirty,	lru,		"clean LRU",	me_pagecache_clean },
+	{ swapbacked,	swapbacked,	"anonymous",	me_pagecache_clean },
+
+	/*
+	 * Catchall entry: must be at end.
+	 */
+	{ 0,		0,		"unknown page state",	me_unknown },
+};
+
+static void action_result(unsigned long pfn, char *msg, int result)
+{
+	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
+		pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
+		msg, action_name[result]);
+}
+
+static void page_action(struct page_state *ps, struct page *p,
+			unsigned long pfn)
+{
+	int result;
+
+	result = ps->action(p, pfn);
+	action_result(pfn, ps->msg, result);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+		       "MCE %#lx: %s page still referenced by %d users\n",
+		       pfn, ps->msg, page_count(p) - 1);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+				  int trapno)
+{
+	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	struct address_space *mapping;
+	LIST_HEAD(tokill);
+	int kill = 1;
+	int ret;
+	int i;
+
+	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+		return;
+
+	if (!PageLRU(p))
+		lru_add_drain();
+
+	/*
+	 * This check implies we don't kill processes if their pages
+	 * are in the swap cache early. Those are always late kills.
+	 */
+	if (!page_mapped(p))
+		return;
+
+	if (PageSwapCache(p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		ttu |= TTU_IGNORE_HWPOISON;
+	}
+
+	/*
+	 * Propagate the dirty bit from PTEs to struct page first, because we
+	 * need this to decide if we should kill or just drop the page.
+	 */
+	mapping = page_mapping(p);
+	if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
+		if (page_mkclean(p))
+			SetPageDirty(p);
+		else {
+			kill = 0;
+			ttu |= TTU_IGNORE_HWPOISON;
+			printk(KERN_INFO
+	"MCE %#lx: corrupted page was clean: dropped without side effects\n",
+				pfn);
+		}
+	}
+
+	/*
+	 * First collect all the processes that have the page
+	 * mapped.  This has to be done before try_to_unmap,
+	 * because ttu takes the rmap data structures down.
+	 *
+	 * Error handling: We ignore errors here because
+	 * there's nothing that can be done.
+	 */
+	if (kill && p->mapping)
+		collect_procs(p, &tokill);
+
+	/*
+	 * try_to_unmap can fail temporarily due to races.
+	 * Try a few times (RED-PEN better strategy?)
+	 */
+	for (i = 0; i < N_UNMAP_TRIES; i++) {
+		ret = try_to_unmap(p, ttu);
+		if (ret == SWAP_SUCCESS)
+			break;
+		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+	}
+
+	if (ret != SWAP_SUCCESS)
+		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
+				pfn, page_mapcount(p));
+
+	/*
+	 * Now that the dirty bit has been propagated to the
+	 * struct page and all unmaps done we can decide if
+	 * killing is needed or not.  Only kill when the page
+	 * was dirty, otherwise the tokill list is merely
+	 * freed.  When there was a problem unmapping earlier
+	 * use a more force-full uncatchable kill to prevent
+	 * any accesses to the poisoned memory.
+	 */
+	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+		      ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ *
+ * This function is called by the low level machine check code
+ * of an architecture when it detects hardware memory corruption
+ * of a page. It tries its best to recover, which includes
+ * dropping pages, killing processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Must run in process context (e.g. a work queue) with interrupts
+ * enabled and no spinlocks hold.
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	struct page_state *ps;
+	struct page *p;
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+		       "MCE %#lx: memory outside kernel control: Ignored\n",
+		       pfn);
+		return;
+	}
+
+	p = pfn_to_page(pfn);
+	if (TestSetPageHWPoison(p)) {
+		action_result(pfn, "already hardware poisoned", IGNORED);
+		return;
+	}
+
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * We need/can do nothing about count=0 pages.
+	 * 1) it's a free page, and therefore in safe hand:
+	 *    prep_new_page() will be the gate keeper.
+	 * 2) it's part of a non-compound high order page.
+	 *    Implies some kernel user: cannot stop them from
+	 *    R/W the page; let's pray that the page has been
+	 *    used and will be freed some time later.
+	 * In fact it's dangerous to directly bump up page count from 0,
+	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+	 */
+	if (!get_page_unless_zero(compound_head(p))) {
+		action_result(pfn, "free or high order kernel", IGNORED);
+		return;
+	}
+
+	/*
+	 * Lock the page and wait for writeback to finish.
+	 * It's very difficult to mess with pages currently under IO
+	 * and in many cases impossible, so we just avoid it here.
+	 */
+	lock_page_nosync(p);
+	wait_on_page_writeback(p);
+
+	/*
+	 * Now take care of user space mappings.
+	 */
+	hwpoison_user_mappings(p, pfn, trapno);
+
+	/*
+	 * Torn down by someone else?
+	 */
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+		action_result(pfn, "already truncated LRU", IGNORED);
+		goto out;
+	}
+
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) == ps->res) {
+			page_action(ps, p, pfn);
+			break;
+		}
+	}
+out:
+	unlock_page(p);
+}
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -814,6 +814,8 @@ static inline void unmap_shared_mapping_
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
+int invalidate_inode_page(struct address_space *mapping, struct page *page);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);
@@ -1325,5 +1327,9 @@ void vmemmap_populate_print_last(void);
 extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim,
 				 size_t size);
 extern void refund_locked_memory(struct mm_struct *mm, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern atomic_long_t mce_bad_pages;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
--- sound-2.6.orig/fs/proc/meminfo.c
+++ sound-2.6/fs/proc/meminfo.c
@@ -95,7 +95,11 @@ static int meminfo_proc_show(struct seq_
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"HardwareCorrupted: %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -140,6 +144,9 @@ static int meminfo_proc_show(struct seq_
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -239,6 +239,9 @@ config KSM
 	  Enable the KSM kernel module to allow page sharing of equal pages
 	  among different tasks.
 
+config MEMORY_FAILURE
+	bool
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
--- sound-2.6.orig/mm/filemap.c
+++ sound-2.6/mm/filemap.c
@@ -105,6 +105,10 @@
  *
  *  ->task->proc_lock
  *    ->dcache_lock		(proc_pid_lookup)
+ *
+ *  (code doesn't rely on that order, so you could switch it around)
+ *  ->tasklist_lock             (memory_failure, collect_procs_ao)
+ *    ->i_mmap_lock
  */
 
 /*
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -36,6 +36,10 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ *   anon_vma->lock      (memory_failure, collect_procs_anon)
  */
 
 #include <linux/mm.h>
@@ -311,7 +315,7 @@ pte_t *page_check_address(struct page *p
  * if the page is not mapped into the page tables of this VMA.  Only
  * valid for normal file or anonymous VMAs.
  */
-static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 {
 	unsigned long address;
 	pte_t *pte;
--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,21 @@ invalidate_complete_page(struct address_
 	return ret;
 }
 
+/*
+ * Safely invalidate one page from its pagecache mapping.
+ * It only drops clean, unused pages. The page must be locked.
+ *
+ * Returns 1 if the page is successfully invalidated, otherwise 0.
+ */
+int invalidate_inode_page(struct address_space *mapping, struct page *page)
+{
+	if (PageDirty(page) || PageWriteback(page))
+		return 0;
+	if (page_mapped(page))
+		return 0;
+	return invalidate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -311,12 +326,8 @@ unsigned long invalidate_mapping_pages(s
 			if (lock_failed)
 				continue;
 
-			if (PageDirty(page) || PageWriteback(page))
-				goto unlock;
-			if (page_mapped(page))
-				goto unlock;
-			ret += invalidate_complete_page(mapping, page);
-unlock:
+			ret += invalidate_inode_page(mapping, page);
+
 			unlock_page(page);
 			if (next > end)
 				break;
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -134,6 +134,7 @@ int page_wrprotect(struct page *page, in
  */
 struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 #else	/* !CONFIG_MMU */
 

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, hugh.dickins, npiggin, chris.mason, Rik van Riel,
	Wu Fengguang, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andi Kleen,
	linux-mm

[-- Attachment #1: high-level --]
[-- Type: text/plain, Size: 33771 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focuses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before killing;
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with syscall prctl(PR_MEMORY_FAILURE_EARLY_KILL, 0/1, ...)
to be implemented in the next patch.  The default will be late kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu:
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)
v5: Improve various comments
Handle page_address_in_vma() failure better by SIGKILL and also
	make message debugging only
Clean up printks
Remove redundant PageWriteback check (Nick Piggin)
Add missing clear_page_mlock
Reformat state table to be <80 columns again
Use truncate helper instead of manual truncate in me_pagecache_*
Check for metadata buffer pages and reject them.
A few cleanups.
v6:
Fix a printk broken in the last round of cleanups.
More minor cleanups and fixes based on comments from FengguangWu.
Rename /proc/meminfo Header to "HardwareCorrupted"
Add a printk for the failed mapping case (Fengguang Wu)
Better clean page check (Fengguang Wu)
v7:
Fix kernel oops by action_result(invalid pfn) (Fengguang Wu)
make tasklist_lock/anon_vma locking order straight (Nick Piggin)
use the safer invalidate page for possible metadata pages (Nick Piggin)
v8:
check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)
test and use page->mapping instead of page_mapping() (Fengguang)
cleanup some early kill comments (Fengguang)
introduce invalidate_inode_page() and don't remove dirty/writeback pages
from page cache (Nick, Fengguang)
account all poisoned pages in mce_bad_pages (Fengguang)

Cc: hugh.dickins@tiscali.co.uk
Cc: npiggin@suse.de
Cc: chris.mason@oracle.com
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/meminfo.c    |    9 
 include/linux/mm.h   |    6 
 include/linux/rmap.h |    1 
 mm/Kconfig           |    3 
 mm/Makefile          |    1 
 mm/filemap.c         |    4 
 mm/memory-failure.c  |  776 +++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c            |    6 
 mm/truncate.c        |   23 -
 9 files changed, 821 insertions(+), 8 deletions(-)

--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -42,5 +42,6 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
--- /dev/null
+++ sound-2.6/mm/memory-failure.c
@@ -0,0 +1,776 @@
+/*
+ * linux/mm/memory-failure.c
+ *
+ * High level machine check handler.
+ *
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Authors: Andi Kleen, Fengguang Wu
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * Pages are reported by the hardware as being corrupted usually due to a
+ * 2bit ECC memory or cache failure. Machine check can either be raised when
+ * corruption is found in background memory scrubbing, or when someone tries to
+ * consume the corruption. This code focuses on the former case.  If it cannot
+ * handle the error for some reason it's safe to just ignore it because no
+ * corruption has been consumed yet. Instead when that happens another (deadly)
+ * machine check will happen.
+ *
+ * The tricky part here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere, possibly
+ * violating some of their assumptions. This is why this code has to be
+ * extremely careful. Generally it tries to use normal locking rules, as in get
+ * the standard locks, even if that means the error handling takes potentially
+ * a long time.
+ *
+ * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
+ * kernel text and slab pages (including the big dcache/icache) are almost
+ * impossible to isolate. We also try to keep the code clean by ignoring the
+ * other thousands of small corruption windows.
+ *
+ * When the corrupted page data is not recoverable, the tasks mapped the page
+ * have to be killed. We offer two kill options:
+ * - early kill with SIGBUS.BUS_MCEERR_AO (optional)
+ * - late  kill with SIGBUS.BUS_MCEERR_AR (mandatory)
+ * A task will be early killed as soon as corruption is found in its virtual
+ * address space, if it has called prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...);
+ * Tasks will always be late killed when it tries to access its corrupted
+ * virtual address. The early kill option offers KVM or other apps with large
+ * caches an opportunity to isolate the corrupted page from its internal cache,
+ * so as to avoid being late killed.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#define DEBUG 1
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+	"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
+		pfn, t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	/*
+	 * Don't use force here, it's convenient if the signal
+	 * can be temporarily blocked.
+	 * This could cause a loop when the user sets SIGBUS
+	 * to SIG_IGN, but hopefully noone will do that?
+	 */
+	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+	unsigned addr_valid:1;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.	We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+			struct vm_area_struct *vma,
+			struct list_head *to_kill,
+			struct to_kill **tkc)
+{
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR
+		"MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	tk->addr_valid = 1;
+
+	/*
+	 * In theory we don't have to kill when the page was
+	 * munmaped. But it could be also a mremap. Since that's
+	 * likely very rare kill anyways just out of paranoia, but use
+	 * a SIGKILL because the error is not contained anymore.
+	 */
+	if (tk->addr == -EFAULT) {
+		pr_debug("MCE: Unable to find user space address %lx in %s\n",
+			page_to_pfn(p), tsk->comm);
+		tk->addr_valid = 0;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+			  int fail, unsigned long pfn)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. Just kill it.
+			 * the signal handlers
+			 */
+			if (fail || tk->addr_valid == 0) {
+				printk(KERN_ERR
+"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+				force_sig(SIGKILL, tk->tsk);
+			}
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+					      pfn) < 0)
+				printk(KERN_ERR
+	"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			       struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av;
+
+	read_lock(&tasklist_lock);
+
+	av = page_lock_anon_vma(page);
+	if (av == NULL)	/* Not actually mapped anymore */
+		goto out;
+
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (!page_mapped_in_vma(page, vma))
+				continue;
+
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	page_unlock_anon_vma(av);
+out:
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			       struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page->mapping;
+
+	/*
+	 * A note on the locking order between the two locks.
+	 * We don't rely on this particular order.
+	 * If you have some other code that needs a different order
+	 * feel free to switch them around. Or add a reverse link
+	 * from mm_struct to task_struct, then this could be all
+	 * done without taking tasklist_lock and looping over all tasks.
+	 */
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			/*
+			 * Send early kill signal to tasks whose vma covers
+			 * the page but not necessarily mapped it in its pte.
+			 * Applications who requested early kill normally want
+			 * to be informed of such data corruptions.
+			 */
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare and only tasks flagged PF_EARLY_KILL will be searched, we hope to
+ * get away with this.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	/*
+	 * First preallocate one to_kill structure outside the spin locks,
+	 * so that we can kill at least one process reasonably reliable.
+	 */
+	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
+
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,		/* Error handling failed */
+	DELAYED,	/* Will be handled later */
+	IGNORED,	/* Error safely ignored */
+	RECOVERED,	/* Successfully recovered */
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p, unsigned long pfn)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p, unsigned long pfn)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p, unsigned long pfn)
+{
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p, unsigned long pfn)
+{
+	return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+	struct address_space *mapping;
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	mapping = page_mapping(p);
+	if (mapping == NULL)
+		return RECOVERED;
+
+	/*
+	 * Now remove it from page cache.
+	 * Currently we only remove clean, unused page for the sake of safety.
+	 */
+	if (!invalidate_inode_page(mapping, p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: failed to remove from page cache\n", pfn);
+		return FAILED;
+	}
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	if (mapping) {
+		/*
+		 * IO error will be reported by write(), fsync(), etc.
+		 * who check the mapping.
+		 * This way the application knows that something went
+		 * wrong with its dirty file data.
+		 *
+		 * There's one open issue:
+		 *
+		 * The EIO will be only reported on the next IO
+		 * operation and then cleared through the IO map.
+		 * Normally Linux has two mechanisms to pass IO error
+		 * first through the AS_EIO flag in the address space
+		 * and then through the PageError flag in the page.
+		 * Since we drop pages on memory failure handling the
+		 * only mechanism open to use is through AS_AIO.
+		 *
+		 * This has the disadvantage that it gets cleared on
+		 * the first operation that returns an error, while
+		 * the PageError bit is more sticky and only cleared
+		 * when the page is reread or dropped.  If an
+		 * application assumes it will always get error on
+		 * fsync, but does other operations on the fd before
+		 * and the page is dropped inbetween then the error
+		 * will not be properly reported.
+		 *
+		 * This can already happen even without hwpoisoned
+		 * pages: first on metadata IO errors (which only
+		 * report through AS_EIO) or when the page is dropped
+		 * at the wrong time.
+		 *
+		 * So right now we assume that the application DTRT on
+		 * the first EIO, but we're not worse than other parts
+		 * of the kernel.
+		 */
+		mapping_set_error(mapping, EIO);
+	}
+
+	return me_pagecache_clean(p, pfn);
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ *      - clear dirty bit to prevent IO
+ *      - remove from LRU
+ *      - but keep in the swap cache, so that when we return to it on
+ *        a later page fault, we know the application is accessing
+ *        corrupted data and shall be killed (we installed simple
+ *        interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+{
+	ClearPageDirty(p);
+	/* Trigger EIO in shmem: */
+	ClearPageUptodate(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p, unsigned long pfn)
+{
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	delete_from_swap_cache(p);
+
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p, unsigned long pfn)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty		(1UL << PG_dirty)
+#define sc		(1UL << PG_swapcache)
+#define unevict		(1UL << PG_unevictable)
+#define mlock		(1UL << PG_mlocked)
+#define writeback	(1UL << PG_writeback)
+#define lru		(1UL << PG_lru)
+#define swapbacked	(1UL << PG_swapbacked)
+#define head		(1UL << PG_head)
+#define tail		(1UL << PG_tail)
+#define compound	(1UL << PG_compound)
+#define slab		(1UL << PG_slab)
+#define buddy		(1UL << PG_buddy)
+#define reserved	(1UL << PG_reserved)
+
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p, unsigned long pfn);
+} error_states[] = {
+	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ buddy,	buddy,		"free kernel",	me_free },
+
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ slab,		slab,		"kernel slab",	me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ head,		head,		"huge",		me_huge_page },
+	{ tail,		tail,		"huge",		me_huge_page },
+#else
+	{ compound,	compound,	"huge",		me_huge_page },
+#endif
+
+	{ sc|dirty,	sc|dirty,	"swapcache",	me_swapcache_dirty },
+	{ sc|dirty,	sc,		"swapcache",	me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ unevict|dirty, unevict|dirty,	"unevictable LRU", me_pagecache_dirty},
+	{ unevict,	unevict,	"unevictable LRU", me_pagecache_clean},
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ mlock|dirty,	mlock|dirty,	"mlocked LRU",	me_pagecache_dirty },
+	{ mlock,	mlock,		"mlocked LRU",	me_pagecache_clean },
+#endif
+
+	{ lru|dirty,	lru|dirty,	"LRU",		me_pagecache_dirty },
+	{ lru|dirty,	lru,		"clean LRU",	me_pagecache_clean },
+	{ swapbacked,	swapbacked,	"anonymous",	me_pagecache_clean },
+
+	/*
+	 * Catchall entry: must be at end.
+	 */
+	{ 0,		0,		"unknown page state",	me_unknown },
+};
+
+static void action_result(unsigned long pfn, char *msg, int result)
+{
+	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
+		pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
+		msg, action_name[result]);
+}
+
+static void page_action(struct page_state *ps, struct page *p,
+			unsigned long pfn)
+{
+	int result;
+
+	result = ps->action(p, pfn);
+	action_result(pfn, ps->msg, result);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+		       "MCE %#lx: %s page still referenced by %d users\n",
+		       pfn, ps->msg, page_count(p) - 1);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+				  int trapno)
+{
+	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	struct address_space *mapping;
+	LIST_HEAD(tokill);
+	int kill = 1;
+	int ret;
+	int i;
+
+	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+		return;
+
+	if (!PageLRU(p))
+		lru_add_drain();
+
+	/*
+	 * This check implies we don't kill processes if their pages
+	 * are in the swap cache early. Those are always late kills.
+	 */
+	if (!page_mapped(p))
+		return;
+
+	if (PageSwapCache(p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		ttu |= TTU_IGNORE_HWPOISON;
+	}
+
+	/*
+	 * Propagate the dirty bit from PTEs to struct page first, because we
+	 * need this to decide if we should kill or just drop the page.
+	 */
+	mapping = page_mapping(p);
+	if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
+		if (page_mkclean(p))
+			SetPageDirty(p);
+		else {
+			kill = 0;
+			ttu |= TTU_IGNORE_HWPOISON;
+			printk(KERN_INFO
+	"MCE %#lx: corrupted page was clean: dropped without side effects\n",
+				pfn);
+		}
+	}
+
+	/*
+	 * First collect all the processes that have the page
+	 * mapped.  This has to be done before try_to_unmap,
+	 * because ttu takes the rmap data structures down.
+	 *
+	 * Error handling: We ignore errors here because
+	 * there's nothing that can be done.
+	 */
+	if (kill && p->mapping)
+		collect_procs(p, &tokill);
+
+	/*
+	 * try_to_unmap can fail temporarily due to races.
+	 * Try a few times (RED-PEN better strategy?)
+	 */
+	for (i = 0; i < N_UNMAP_TRIES; i++) {
+		ret = try_to_unmap(p, ttu);
+		if (ret == SWAP_SUCCESS)
+			break;
+		pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+	}
+
+	if (ret != SWAP_SUCCESS)
+		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
+				pfn, page_mapcount(p));
+
+	/*
+	 * Now that the dirty bit has been propagated to the
+	 * struct page and all unmaps done we can decide if
+	 * killing is needed or not.  Only kill when the page
+	 * was dirty, otherwise the tokill list is merely
+	 * freed.  When there was a problem unmapping earlier
+	 * use a more force-full uncatchable kill to prevent
+	 * any accesses to the poisoned memory.
+	 */
+	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+		      ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ *
+ * This function is called by the low level machine check code
+ * of an architecture when it detects hardware memory corruption
+ * of a page. It tries its best to recover, which includes
+ * dropping pages, killing processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Must run in process context (e.g. a work queue) with interrupts
+ * enabled and no spinlocks hold.
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	struct page_state *ps;
+	struct page *p;
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+		       "MCE %#lx: memory outside kernel control: Ignored\n",
+		       pfn);
+		return;
+	}
+
+	p = pfn_to_page(pfn);
+	if (TestSetPageHWPoison(p)) {
+		action_result(pfn, "already hardware poisoned", IGNORED);
+		return;
+	}
+
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * We need/can do nothing about count=0 pages.
+	 * 1) it's a free page, and therefore in safe hand:
+	 *    prep_new_page() will be the gate keeper.
+	 * 2) it's part of a non-compound high order page.
+	 *    Implies some kernel user: cannot stop them from
+	 *    R/W the page; let's pray that the page has been
+	 *    used and will be freed some time later.
+	 * In fact it's dangerous to directly bump up page count from 0,
+	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+	 */
+	if (!get_page_unless_zero(compound_head(p))) {
+		action_result(pfn, "free or high order kernel", IGNORED);
+		return;
+	}
+
+	/*
+	 * Lock the page and wait for writeback to finish.
+	 * It's very difficult to mess with pages currently under IO
+	 * and in many cases impossible, so we just avoid it here.
+	 */
+	lock_page_nosync(p);
+	wait_on_page_writeback(p);
+
+	/*
+	 * Now take care of user space mappings.
+	 */
+	hwpoison_user_mappings(p, pfn, trapno);
+
+	/*
+	 * Torn down by someone else?
+	 */
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+		action_result(pfn, "already truncated LRU", IGNORED);
+		goto out;
+	}
+
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) == ps->res) {
+			page_action(ps, p, pfn);
+			break;
+		}
+	}
+out:
+	unlock_page(p);
+}
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -814,6 +814,8 @@ static inline void unmap_shared_mapping_
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
+int invalidate_inode_page(struct address_space *mapping, struct page *page);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);
@@ -1325,5 +1327,9 @@ void vmemmap_populate_print_last(void);
 extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim,
 				 size_t size);
 extern void refund_locked_memory(struct mm_struct *mm, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern atomic_long_t mce_bad_pages;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
--- sound-2.6.orig/fs/proc/meminfo.c
+++ sound-2.6/fs/proc/meminfo.c
@@ -95,7 +95,11 @@ static int meminfo_proc_show(struct seq_
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"HardwareCorrupted: %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -140,6 +144,9 @@ static int meminfo_proc_show(struct seq_
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -239,6 +239,9 @@ config KSM
 	  Enable the KSM kernel module to allow page sharing of equal pages
 	  among different tasks.
 
+config MEMORY_FAILURE
+	bool
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
--- sound-2.6.orig/mm/filemap.c
+++ sound-2.6/mm/filemap.c
@@ -105,6 +105,10 @@
  *
  *  ->task->proc_lock
  *    ->dcache_lock		(proc_pid_lookup)
+ *
+ *  (code doesn't rely on that order, so you could switch it around)
+ *  ->tasklist_lock             (memory_failure, collect_procs_ao)
+ *    ->i_mmap_lock
  */
 
 /*
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -36,6 +36,10 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ *   anon_vma->lock      (memory_failure, collect_procs_anon)
  */
 
 #include <linux/mm.h>
@@ -311,7 +315,7 @@ pte_t *page_check_address(struct page *p
  * if the page is not mapped into the page tables of this VMA.  Only
  * valid for normal file or anonymous VMAs.
  */
-static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 {
 	unsigned long address;
 	pte_t *pte;
--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,21 @@ invalidate_complete_page(struct address_
 	return ret;
 }
 
+/*
+ * Safely invalidate one page from its pagecache mapping.
+ * It only drops clean, unused pages. The page must be locked.
+ *
+ * Returns 1 if the page is successfully invalidated, otherwise 0.
+ */
+int invalidate_inode_page(struct address_space *mapping, struct page *page)
+{
+	if (PageDirty(page) || PageWriteback(page))
+		return 0;
+	if (page_mapped(page))
+		return 0;
+	return invalidate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -311,12 +326,8 @@ unsigned long invalidate_mapping_pages(s
 			if (lock_failed)
 				continue;
 
-			if (PageDirty(page) || PageWriteback(page))
-				goto unlock;
-			if (page_mapped(page))
-				goto unlock;
-			ret += invalidate_complete_page(mapping, page);
-unlock:
+			ret += invalidate_inode_page(mapping, page);
+
 			unlock_page(page);
 			if (next > end)
 				break;
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -134,6 +134,7 @@ int page_wrprotect(struct page *page, in
  */
 struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 #else	/* !CONFIG_MMU */
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-prctl-early-kill.patch --]
[-- Type: text/plain, Size: 3255 bytes --]

This allows an application to request for early SIGBUS.BUS_MCEERR_AO
notification as soon as memory corruption in its virtual address space is
detected.

The default option is late kill, ie. only kill the process when it actually
tries to access the corrupted data. But an admin can still request a legacy
application to be early killed by writing a wrapper tool which calls prctl()
and exec the application:

	# this_app_shall_be_early_killed  legacy_app

KVM needs the early kill signal. At early kill time it has good opportunity
to isolate the corruption in guest kernel pages. It will be too late to do
anything useful on late kill.

Proposed by Nick Pidgin.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/prctl.h |    6 ++++++
 include/linux/sched.h |    1 +
 kernel/sys.c          |    6 ++++++
 mm/memory-failure.c   |   12 ++++++++++--
 4 files changed, 23 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/prctl.h
+++ sound-2.6/include/linux/prctl.h
@@ -88,4 +88,10 @@
 #define PR_TASK_PERF_COUNTERS_DISABLE		31
 #define PR_TASK_PERF_COUNTERS_ENABLE		32
 
+/*
+ * Send early SIGBUS.BUS_MCEERR_AO notification on memory corruption?
+ * Useful for KVM and mission critical apps.
+ */
+#define PR_MEMORY_FAILURE_EARLY_KILL		33
+
 #endif /* _LINUX_PRCTL_H */
--- sound-2.6.orig/include/linux/sched.h
+++ sound-2.6/include/linux/sched.h
@@ -1666,6 +1666,7 @@ extern cputime_t task_gtime(struct task_
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_FLUSHER	0x00001000	/* responsible for disk writeback */
 #define PF_USED_MATH	0x00002000	/* if unset the fpu must be initialized before use */
+#define PF_EARLY_KILL	0x00004000	/* kill me early on memory failure */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
--- sound-2.6.orig/kernel/sys.c
+++ sound-2.6/kernel/sys.c
@@ -1545,6 +1545,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 				current->timer_slack_ns = arg2;
 			error = 0;
 			break;
+		case PR_MEMORY_FAILURE_EARLY_KILL:
+			if (arg2)
+				me->flags |= PF_EARLY_KILL;
+			else
+				me->flags &= ~PF_EARLY_KILL;
+			break;
 		default:
 			error = -EINVAL;
 			break;
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -214,6 +214,14 @@ static void kill_procs_ao(struct list_he
 	}
 }
 
+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+	if (!tsk->mm)
+		return false;
+
+	return tsk->flags & PF_EARLY_KILL;
+}
+
 /*
  * Collect processes when the error hit an anonymous page.
  */
@@ -231,7 +239,7 @@ static void collect_procs_anon(struct pa
 		goto out;
 
 	for_each_process (tsk) {
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
 			if (!page_mapped_in_vma(page, vma))
@@ -271,7 +279,7 @@ static void collect_procs_file(struct pa
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

[-- Attachment #1: hwpoison-prctl-early-kill.patch --]
[-- Type: text/plain, Size: 3254 bytes --]

This allows an application to request for early SIGBUS.BUS_MCEERR_AO
notification as soon as memory corruption in its virtual address space is
detected.

The default option is late kill, ie. only kill the process when it actually
tries to access the corrupted data. But an admin can still request a legacy
application to be early killed by writing a wrapper tool which calls prctl()
and exec the application:

	# this_app_shall_be_early_killed  legacy_app

KVM needs the early kill signal. At early kill time it has good opportunity
to isolate the corruption in guest kernel pages. It will be too late to do
anything useful on late kill.

Proposed by Nick Pidgin.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/prctl.h |    6 ++++++
 include/linux/sched.h |    1 +
 kernel/sys.c          |    6 ++++++
 mm/memory-failure.c   |   12 ++++++++++--
 4 files changed, 23 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/prctl.h
+++ sound-2.6/include/linux/prctl.h
@@ -88,4 +88,10 @@
 #define PR_TASK_PERF_COUNTERS_DISABLE		31
 #define PR_TASK_PERF_COUNTERS_ENABLE		32
 
+/*
+ * Send early SIGBUS.BUS_MCEERR_AO notification on memory corruption?
+ * Useful for KVM and mission critical apps.
+ */
+#define PR_MEMORY_FAILURE_EARLY_KILL		33
+
 #endif /* _LINUX_PRCTL_H */
--- sound-2.6.orig/include/linux/sched.h
+++ sound-2.6/include/linux/sched.h
@@ -1666,6 +1666,7 @@ extern cputime_t task_gtime(struct task_
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_FLUSHER	0x00001000	/* responsible for disk writeback */
 #define PF_USED_MATH	0x00002000	/* if unset the fpu must be initialized before use */
+#define PF_EARLY_KILL	0x00004000	/* kill me early on memory failure */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
--- sound-2.6.orig/kernel/sys.c
+++ sound-2.6/kernel/sys.c
@@ -1545,6 +1545,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 				current->timer_slack_ns = arg2;
 			error = 0;
 			break;
+		case PR_MEMORY_FAILURE_EARLY_KILL:
+			if (arg2)
+				me->flags |= PF_EARLY_KILL;
+			else
+				me->flags &= ~PF_EARLY_KILL;
+			break;
 		default:
 			error = -EINVAL;
 			break;
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -214,6 +214,14 @@ static void kill_procs_ao(struct list_he
 	}
 }
 
+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+	if (!tsk->mm)
+		return false;
+
+	return tsk->flags & PF_EARLY_KILL;
+}
+
 /*
  * Collect processes when the error hit an anonymous page.
  */
@@ -231,7 +239,7 @@ static void collect_procs_anon(struct pa
 		goto out;
 
 	for_each_process (tsk) {
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
 			if (!page_mapped_in_vma(page, vma))
@@ -271,7 +279,7 @@ static void collect_procs_file(struct pa
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

-- 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 13/15] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: mf-inject-madvise --]
[-- Type: text/plain, Size: 2854 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman-common.h |    1 
 mm/madvise.c                      |   36 ++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

--- sound-2.6.orig/mm/madvise.c
+++ sound-2.6/mm/madvise.c
@@ -207,6 +207,38 @@ static long madvise_remove(struct vm_are
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+		 * RED-PEN page can be reused in a short window, but otherwise
+		 * we'll have to fight with the reference count.
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -307,6 +339,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, 
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON)
+		return madvise_hwpoison(start, start+len_in);
+#endif
 	if (!madvise_behavior_valid(behavior))
 		return error;
 
--- sound-2.6.orig/include/asm-generic/mman-common.h
+++ sound-2.6/include/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_HWPOISON	12		/* poison a page for testing */
 
 /* compatibility flags */
 #define MAP_FILE	0

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 13/15] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: mf-inject-madvise --]
[-- Type: text/plain, Size: 3079 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman-common.h |    1 
 mm/madvise.c                      |   36 ++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

--- sound-2.6.orig/mm/madvise.c
+++ sound-2.6/mm/madvise.c
@@ -207,6 +207,38 @@ static long madvise_remove(struct vm_are
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+		 * RED-PEN page can be reused in a short window, but otherwise
+		 * we'll have to fight with the reference count.
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -307,6 +339,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, 
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON)
+		return madvise_hwpoison(start, start+len_in);
+#endif
 	if (!madvise_behavior_valid(behavior))
 		return error;
 
--- sound-2.6.orig/include/asm-generic/mman-common.h
+++ sound-2.6/include/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_HWPOISON	12		/* poison a page for testing */
 
 /* compatibility flags */
 #define MAP_FILE	0

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 14/15] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: pfn-inject --]
[-- Type: text/plain, Size: 2353 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues: 
Should be disabled for cgroups.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig           |    4 ++++
 mm/Makefile          |    1 +
 mm/hwpoison-inject.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

--- /dev/null
+++ sound-2.6/mm/hwpoison-inject.c
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+	memory_failure(val, 18);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+	if (hwpoison_dir)
+		debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+	if (hwpoison_dir == NULL)
+		return -ENOMEM;
+	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+					  NULL, &hwpoison_fops);
+	if (corrupt_pfn == NULL) {
+		pfn_inject_exit();
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -242,6 +242,10 @@ config KSM
 config MEMORY_FAILURE
 	bool
 
+config HWPOISON_INJECT
+	tristate "Hardware poison pages injector"
+	depends on MEMORY_FAILURE && DEBUG_KERNEL
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -43,5 +43,6 @@ endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 14/15] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: pfn-inject --]
[-- Type: text/plain, Size: 2578 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues: 
Should be disabled for cgroups.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig           |    4 ++++
 mm/Makefile          |    1 +
 mm/hwpoison-inject.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

--- /dev/null
+++ sound-2.6/mm/hwpoison-inject.c
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+	memory_failure(val, 18);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+	if (hwpoison_dir)
+		debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+	if (hwpoison_dir == NULL)
+		return -ENOMEM;
+	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+					  NULL, &hwpoison_fops);
+	if (corrupt_pfn == NULL) {
+		pfn_inject_exit();
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -242,6 +242,10 @@ config KSM
 config MEMORY_FAILURE
 	bool
 
+config HWPOISON_INJECT
+	tristate "Hardware poison pages injector"
+	depends on MEMORY_FAILURE && DEBUG_KERNEL
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -43,5 +43,6 @@ endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 15/15] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
  2009-06-20  3:16 ` Wu Fengguang
@ 2009-06-20  3:16   ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: hwpoison-enable --]
[-- Type: text/plain, Size: 566 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -241,6 +241,8 @@ config KSM
 
 config MEMORY_FAILURE
 	bool
+	default y
+	depends on MMU
 
 config HWPOISON_INJECT
 	tristate "Hardware poison pages injector"

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 15/15] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
@ 2009-06-20  3:16   ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-20  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman, Wu,
	Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

[-- Attachment #1: hwpoison-enable --]
[-- Type: text/plain, Size: 791 bytes --]

From: Andi Kleen <ak@linux.intel.com>

Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -241,6 +241,8 @@ config KSM
 
 config MEMORY_FAILURE
 	bool
+	default y
+	depends on MMU
 
 config HWPOISON_INJECT
 	tristate "Hardware poison pages injector"

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
  2009-06-20  3:16   ` Wu Fengguang
@ 2009-06-21  8:52     ` Andi Kleen
  -1 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2009-06-21  8:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Sat, Jun 20, 2009 at 11:16:20AM +0800, Wu Fengguang wrote:
> The default option is late kill, ie. only kill the process when it actually
> tries to access the corrupted data. But an admin can still request a legacy
> application to be early killed by writing a wrapper tool which calls prctl()
> and exec the application:
> 
> 	# this_app_shall_be_early_killed  legacy_app
> 
> KVM needs the early kill signal. At early kill time it has good opportunity
> to isolate the corruption in guest kernel pages. It will be too late to do
> anything useful on late kill.
> 
> Proposed by Nick Pidgin.

If anything you would need two flags per process: one to signify
that the application set the flag and another what the actual
value is.

Also you broke the existing qemu implementation now which obviously
doesn't know about this new flag.

I don't think we need this patch right now.

> +static bool task_early_kill_elegible(struct task_struct *tsk)
> +{
> +	if (!tsk->mm)
> +		return false;

I don't think this can happen.

> +
> +	return tsk->flags & PF_EARLY_KILL;

This type mixing is also dangerous, if someone create e.g. a char bool
it would be always false.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
@ 2009-06-21  8:52     ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2009-06-21  8:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Sat, Jun 20, 2009 at 11:16:20AM +0800, Wu Fengguang wrote:
> The default option is late kill, ie. only kill the process when it actually
> tries to access the corrupted data. But an admin can still request a legacy
> application to be early killed by writing a wrapper tool which calls prctl()
> and exec the application:
> 
> 	# this_app_shall_be_early_killed  legacy_app
> 
> KVM needs the early kill signal. At early kill time it has good opportunity
> to isolate the corruption in guest kernel pages. It will be too late to do
> anything useful on late kill.
> 
> Proposed by Nick Pidgin.

If anything you would need two flags per process: one to signify
that the application set the flag and another what the actual
value is.

Also you broke the existing qemu implementation now which obviously
doesn't know about this new flag.

I don't think we need this patch right now.

> +static bool task_early_kill_elegible(struct task_struct *tsk)
> +{
> +	if (!tsk->mm)
> +		return false;

I don't think this can happen.

> +
> +	return tsk->flags & PF_EARLY_KILL;

This type mixing is also dangerous, if someone create e.g. a char bool
it would be always false.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  2009-06-20  3:16   ` Wu Fengguang
@ 2009-06-21  8:57     ` Andi Kleen
  -1 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2009-06-21  8:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, hugh.dickins, npiggin, chris.mason,
	Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andi Kleen,
	linux-mm

> v8:
> check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)

This change was no good as discussed earlier.

> test and use page->mapping instead of page_mapping() (Fengguang)
> cleanup some early kill comments (Fengguang)

This stuff belongs into the manpage. I haven't written it yet,
but will. I don't think kernel source comments is the right place.

> introduce invalidate_inode_page() and don't remove dirty/writeback pages
> from page cache (Nick, Fengguang)

I'm still dubious this is a good idea, it means potentially a lot 
of pages not covered.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
@ 2009-06-21  8:57     ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2009-06-21  8:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, hugh.dickins, npiggin, chris.mason,
	Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andi Kleen,
	linux-mm

> v8:
> check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)

This change was no good as discussed earlier.

> test and use page->mapping instead of page_mapping() (Fengguang)
> cleanup some early kill comments (Fengguang)

This stuff belongs into the manpage. I haven't written it yet,
but will. I don't think kernel source comments is the right place.

> introduce invalidate_inode_page() and don't remove dirty/writeback pages
> from page cache (Nick, Fengguang)

I'm still dubious this is a good idea, it means potentially a lot 
of pages not covered.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  2009-06-21  8:57     ` Andi Kleen
@ 2009-06-22  9:02       ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-22  9:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, hugh.dickins, npiggin, chris.mason,
	Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, linux-mm

On Sun, Jun 21, 2009 at 04:57:21PM +0800, Andi Kleen wrote:
> > v8:
> > check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)
> 
> This change was no good as discussed earlier.
> 
> > test and use page->mapping instead of page_mapping() (Fengguang)
> > cleanup some early kill comments (Fengguang)
> 
> This stuff belongs into the manpage. I haven't written it yet,
> but will. I don't think kernel source comments is the right place.

OK. You may take this block

        + * When the corrupted page data is not recoverable, the tasks mapped the page
        + * have to be killed. We offer two kill options:
        + * - early kill with SIGBUS.BUS_MCEERR_AO (optional)
        + * - late  kill with SIGBUS.BUS_MCEERR_AR (mandatory)
        + * A task will be early killed as soon as corruption is found in its virtual
        + * address space, if it has called prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...);
        + * Any task will be late killed when it tries to access its corrupted virtual
        + * address. The early kill option offers KVM or other apps with large caches an
        + * opportunity to isolate the corrupted page from its internal cache, so as to
        + * avoid being late killed.

and/or this one into the man page :)

        +=============================================================
        +
        +memory_failure_early_kill:
        +
        +Control how to kill processes when uncorrected memory error (typically
        +a 2bit error in a memory module) is detected in the background by hardware
        +that cannot be handled by the kernel. In some cases (like the page
        +still having a valid copy on disk) the kernel will handle the failure
        +transparently without affecting any applications. But if there is
        +no other uptodate copy of the data it will kill to prevent any data
        +corruptions from propagating.
        +
        +1: Kill all processes that have the corrupted and not reloadable page mapped
        +as soon as the corruption is detected.  Note this is not supported
        +for a few types of pages, like kernel internally allocated data or
        +the swap cache, but works for the majority of user pages.
        +
        +0: Only unmap the corrupted page from all processes and only kill a process
        +who tries to access it.
        +
        +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
        +handle this if they want to.
        +
        +This is only active on architectures/platforms with advanced machine
        +check handling and depends on the hardware capabilities.
        +
         ==============================================================


But I think these new comments are OK?

        + * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
        + * kernel text and slab pages (including the big dcache/icache) are almost
        + * impossible to isolate. We also try to keep the code clean by ignoring the
        + * other thousands of small corruption windows.
        + *

And this:

        @@ -275,6 +286,12 @@ static void collect_procs_file(struct pa

                        vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
                                              pgoff)
        +                       /*
        +                        * Send early kill signal to tasks whose vma covers
        +                        * the page but not necessarily mapped it in its pte.
        +                        * Applications who requested early kill normally want
        +                        * to be informed of such data corruptions.
        +                        */
                                if (vma->vm_mm == tsk->mm)
                                        add_to_kill(tsk, page, vma, to_kill, tkc);
                }

> > introduce invalidate_inode_page() and don't remove dirty/writeback pages
> > from page cache (Nick, Fengguang)
> 
> I'm still dubious this is a good idea, it means potentially a lot 
> of pages not covered.

This is good for .31 I think. But for .32 we can sure cover more pages :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
@ 2009-06-22  9:02       ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-22  9:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, hugh.dickins, npiggin, chris.mason,
	Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, linux-mm

On Sun, Jun 21, 2009 at 04:57:21PM +0800, Andi Kleen wrote:
> > v8:
> > check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)
> 
> This change was no good as discussed earlier.
> 
> > test and use page->mapping instead of page_mapping() (Fengguang)
> > cleanup some early kill comments (Fengguang)
> 
> This stuff belongs into the manpage. I haven't written it yet,
> but will. I don't think kernel source comments is the right place.

OK. You may take this block

        + * When the corrupted page data is not recoverable, the tasks mapped the page
        + * have to be killed. We offer two kill options:
        + * - early kill with SIGBUS.BUS_MCEERR_AO (optional)
        + * - late  kill with SIGBUS.BUS_MCEERR_AR (mandatory)
        + * A task will be early killed as soon as corruption is found in its virtual
        + * address space, if it has called prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...);
        + * Any task will be late killed when it tries to access its corrupted virtual
        + * address. The early kill option offers KVM or other apps with large caches an
        + * opportunity to isolate the corrupted page from its internal cache, so as to
        + * avoid being late killed.

and/or this one into the man page :)

        +=============================================================
        +
        +memory_failure_early_kill:
        +
        +Control how to kill processes when uncorrected memory error (typically
        +a 2bit error in a memory module) is detected in the background by hardware
        +that cannot be handled by the kernel. In some cases (like the page
        +still having a valid copy on disk) the kernel will handle the failure
        +transparently without affecting any applications. But if there is
        +no other uptodate copy of the data it will kill to prevent any data
        +corruptions from propagating.
        +
        +1: Kill all processes that have the corrupted and not reloadable page mapped
        +as soon as the corruption is detected.  Note this is not supported
        +for a few types of pages, like kernel internally allocated data or
        +the swap cache, but works for the majority of user pages.
        +
        +0: Only unmap the corrupted page from all processes and only kill a process
        +who tries to access it.
        +
        +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
        +handle this if they want to.
        +
        +This is only active on architectures/platforms with advanced machine
        +check handling and depends on the hardware capabilities.
        +
         ==============================================================


But I think these new comments are OK?

        + * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
        + * kernel text and slab pages (including the big dcache/icache) are almost
        + * impossible to isolate. We also try to keep the code clean by ignoring the
        + * other thousands of small corruption windows.
        + *

And this:

        @@ -275,6 +286,12 @@ static void collect_procs_file(struct pa

                        vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
                                              pgoff)
        +                       /*
        +                        * Send early kill signal to tasks whose vma covers
        +                        * the page but not necessarily mapped it in its pte.
        +                        * Applications who requested early kill normally want
        +                        * to be informed of such data corruptions.
        +                        */
                                if (vma->vm_mm == tsk->mm)
                                        add_to_kill(tsk, page, vma, to_kill, tkc);
                }

> > introduce invalidate_inode_page() and don't remove dirty/writeback pages
> > from page cache (Nick, Fengguang)
> 
> I'm still dubious this is a good idea, it means potentially a lot 
> of pages not covered.

This is good for .31 I think. But for .32 we can sure cover more pages :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
  2009-06-21  8:52     ` Andi Kleen
@ 2009-06-22  9:27       ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-22  9:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Nick Piggin, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Sun, Jun 21, 2009 at 04:52:12PM +0800, Andi Kleen wrote:
> On Sat, Jun 20, 2009 at 11:16:20AM +0800, Wu Fengguang wrote:
> > The default option is late kill, ie. only kill the process when it actually
> > tries to access the corrupted data. But an admin can still request a legacy
> > application to be early killed by writing a wrapper tool which calls prctl()
> > and exec the application:
> > 
> > 	# this_app_shall_be_early_killed  legacy_app
> > 
> > KVM needs the early kill signal. At early kill time it has good opportunity
> > to isolate the corruption in guest kernel pages. It will be too late to do
> > anything useful on late kill.
> > 
> > Proposed by Nick Pidgin.
> 
> If anything you would need two flags per process: one to signify
> that the application set the flag and another what the actual
> value is.

OK.

> Also you broke the existing qemu implementation now which obviously
> doesn't know about this new flag.

Yes, it is.

> I don't think we need this patch right now.

There must be a policy and maybe a parameter.
Till now there are three schemes:

A) default to early kill; export vm.memory_failure_early_kill
B) default to late kill; export prctl() parameters
C) early kill if SIGBUS handler is installed, otherwise late kill;
   no parameters is exported.

If we don't do this patch, what would be your preferred policy?

> > +static bool task_early_kill_elegible(struct task_struct *tsk)
> > +{
> > +	if (!tsk->mm)
> > +		return false;
> 
> I don't think this can happen.

It's for skipping kernel threads, you wrote that code :)

> > +
> > +	return tsk->flags & PF_EARLY_KILL;
> 
> This type mixing is also dangerous, if someone create e.g. a char bool
> it would be always false.

That's right. Here is the updated patch.

---
HWPOISON: per process early kill option prctl(PR_SET_MEMORY_FAILURE_EARLY_KILL)

This allows an application to request for early SIGBUS.BUS_MCEERR_AO
notification as soon as memory corruption in its virtual address space is
detected.

The default option is late kill, ie. only kill the process when it actually
tries to access the corrupted data. But an admin can still request a legacy
application to be early killed by writing a wrapper tool which calls prctl()
and exec the application:

	# this_app_shall_be_early_killed  legacy_app

KVM needs the early kill signal. At early kill time it has good opportunity
to isolate the corruption in guest kernel pages. It will be too late to do
anything useful on late kill.

Proposed by Nick Pidgin.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/prctl.h |    7 +++++++
 include/linux/sched.h |    1 +
 kernel/sys.c          |    9 +++++++++
 mm/memory-failure.c   |   12 ++++++++++--
 4 files changed, 27 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/prctl.h
+++ sound-2.6/include/linux/prctl.h
@@ -88,4 +88,11 @@
 #define PR_TASK_PERF_COUNTERS_DISABLE		31
 #define PR_TASK_PERF_COUNTERS_ENABLE		32
 
+/*
+ * Send early SIGBUS.BUS_MCEERR_AO notification on memory corruption?
+ * Useful for KVM and mission critical apps.
+ */
+#define PR_SET_MEMORY_FAILURE_EARLY_KILL	33
+#define PR_GET_MEMORY_FAILURE_EARLY_KILL	34
+
 #endif /* _LINUX_PRCTL_H */
--- sound-2.6.orig/include/linux/sched.h
+++ sound-2.6/include/linux/sched.h
@@ -1666,6 +1666,7 @@ extern cputime_t task_gtime(struct task_
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_FLUSHER	0x00001000	/* responsible for disk writeback */
 #define PF_USED_MATH	0x00002000	/* if unset the fpu must be initialized before use */
+#define PF_EARLY_KILL	0x00004000	/* kill me early on memory failure */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
--- sound-2.6.orig/kernel/sys.c
+++ sound-2.6/kernel/sys.c
@@ -1545,6 +1545,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 				current->timer_slack_ns = arg2;
 			error = 0;
 			break;
+		case PR_GET_MEMORY_FAILURE_EARLY_KILL:
+			error = !!(me->flags & PF_EARLY_KILL);
+			break;
+		case PR_SET_MEMORY_FAILURE_EARLY_KILL:
+			if (arg2)
+				me->flags |= PF_EARLY_KILL;
+			else
+				me->flags &= ~PF_EARLY_KILL;
+			break;
 		default:
 			error = -EINVAL;
 			break;
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -214,6 +214,14 @@ static void kill_procs_ao(struct list_he
 	}
 }
 
+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+	if (!tsk->mm)
+		return false;
+
+	return !!(tsk->flags & PF_EARLY_KILL);
+}
+
 /*
  * Collect processes when the error hit an anonymous page.
  */
@@ -231,7 +239,7 @@ static void collect_procs_anon(struct pa
 		goto out;
 
 	for_each_process (tsk) {
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
 			if (!page_mapped_in_vma(page, vma))
@@ -271,7 +279,7 @@ static void collect_procs_file(struct pa
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL)
@ 2009-06-22  9:27       ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-22  9:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Nick Piggin, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Hugh Dickins, riel, chris.mason, linux-mm

On Sun, Jun 21, 2009 at 04:52:12PM +0800, Andi Kleen wrote:
> On Sat, Jun 20, 2009 at 11:16:20AM +0800, Wu Fengguang wrote:
> > The default option is late kill, ie. only kill the process when it actually
> > tries to access the corrupted data. But an admin can still request a legacy
> > application to be early killed by writing a wrapper tool which calls prctl()
> > and exec the application:
> > 
> > 	# this_app_shall_be_early_killed  legacy_app
> > 
> > KVM needs the early kill signal. At early kill time it has good opportunity
> > to isolate the corruption in guest kernel pages. It will be too late to do
> > anything useful on late kill.
> > 
> > Proposed by Nick Pidgin.
> 
> If anything you would need two flags per process: one to signify
> that the application set the flag and another what the actual
> value is.

OK.

> Also you broke the existing qemu implementation now which obviously
> doesn't know about this new flag.

Yes, it is.

> I don't think we need this patch right now.

There must be a policy and maybe a parameter.
Till now there are three schemes:

A) default to early kill; export vm.memory_failure_early_kill
B) default to late kill; export prctl() parameters
C) early kill if SIGBUS handler is installed, otherwise late kill;
   no parameters is exported.

If we don't do this patch, what would be your preferred policy?

> > +static bool task_early_kill_elegible(struct task_struct *tsk)
> > +{
> > +	if (!tsk->mm)
> > +		return false;
> 
> I don't think this can happen.

It's for skipping kernel threads, you wrote that code :)

> > +
> > +	return tsk->flags & PF_EARLY_KILL;
> 
> This type mixing is also dangerous, if someone create e.g. a char bool
> it would be always false.

That's right. Here is the updated patch.

---
HWPOISON: per process early kill option prctl(PR_SET_MEMORY_FAILURE_EARLY_KILL)

This allows an application to request for early SIGBUS.BUS_MCEERR_AO
notification as soon as memory corruption in its virtual address space is
detected.

The default option is late kill, ie. only kill the process when it actually
tries to access the corrupted data. But an admin can still request a legacy
application to be early killed by writing a wrapper tool which calls prctl()
and exec the application:

	# this_app_shall_be_early_killed  legacy_app

KVM needs the early kill signal. At early kill time it has good opportunity
to isolate the corruption in guest kernel pages. It will be too late to do
anything useful on late kill.

Proposed by Nick Pidgin.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/prctl.h |    7 +++++++
 include/linux/sched.h |    1 +
 kernel/sys.c          |    9 +++++++++
 mm/memory-failure.c   |   12 ++++++++++--
 4 files changed, 27 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/prctl.h
+++ sound-2.6/include/linux/prctl.h
@@ -88,4 +88,11 @@
 #define PR_TASK_PERF_COUNTERS_DISABLE		31
 #define PR_TASK_PERF_COUNTERS_ENABLE		32
 
+/*
+ * Send early SIGBUS.BUS_MCEERR_AO notification on memory corruption?
+ * Useful for KVM and mission critical apps.
+ */
+#define PR_SET_MEMORY_FAILURE_EARLY_KILL	33
+#define PR_GET_MEMORY_FAILURE_EARLY_KILL	34
+
 #endif /* _LINUX_PRCTL_H */
--- sound-2.6.orig/include/linux/sched.h
+++ sound-2.6/include/linux/sched.h
@@ -1666,6 +1666,7 @@ extern cputime_t task_gtime(struct task_
 #define PF_MEMALLOC	0x00000800	/* Allocating memory */
 #define PF_FLUSHER	0x00001000	/* responsible for disk writeback */
 #define PF_USED_MATH	0x00002000	/* if unset the fpu must be initialized before use */
+#define PF_EARLY_KILL	0x00004000	/* kill me early on memory failure */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
--- sound-2.6.orig/kernel/sys.c
+++ sound-2.6/kernel/sys.c
@@ -1545,6 +1545,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 				current->timer_slack_ns = arg2;
 			error = 0;
 			break;
+		case PR_GET_MEMORY_FAILURE_EARLY_KILL:
+			error = !!(me->flags & PF_EARLY_KILL);
+			break;
+		case PR_SET_MEMORY_FAILURE_EARLY_KILL:
+			if (arg2)
+				me->flags |= PF_EARLY_KILL;
+			else
+				me->flags &= ~PF_EARLY_KILL;
+			break;
 		default:
 			error = -EINVAL;
 			break;
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -214,6 +214,14 @@ static void kill_procs_ao(struct list_he
 	}
 }
 
+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+	if (!tsk->mm)
+		return false;
+
+	return !!(tsk->flags & PF_EARLY_KILL);
+}
+
 /*
  * Collect processes when the error hit an anonymous page.
  */
@@ -231,7 +239,7 @@ static void collect_procs_anon(struct pa
 		goto out;
 
 	for_each_process (tsk) {
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 		list_for_each_entry (vma, &av->head, anon_vma_node) {
 			if (!page_mapped_in_vma(page, vma))
@@ -271,7 +279,7 @@ static void collect_procs_file(struct pa
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
-		if (!tsk->mm)
+		if (!task_early_kill_elegible(tsk))
 			continue;
 
 		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  2009-06-21  8:57     ` Andi Kleen
@ 2009-06-22  9:37       ` Wu Fengguang
  -1 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-22  9:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, hugh.dickins, npiggin, chris.mason,
	Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, linux-mm

On Sun, Jun 21, 2009 at 04:57:21PM +0800, Andi Kleen wrote:
> > v8:
> > check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)
> 
> This change was no good as discussed earlier.

My understanding is, we don't do page_mapped_in_vma() for file pages.
But for anon pages, this check can avoid killing possibly good tasks.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
@ 2009-06-22  9:37       ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2009-06-22  9:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, hugh.dickins, npiggin, chris.mason,
	Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim, Mel Gorman,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, linux-mm

On Sun, Jun 21, 2009 at 04:57:21PM +0800, Andi Kleen wrote:
> > v8:
> > check for page_mapped_in_vma() on anon pages (Hugh, Fengguang)
> 
> This change was no good as discussed earlier.

My understanding is, we don't do page_mapped_in_vma() for file pages.
But for anon pages, this check can avoid killing possibly good tasks.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
  2009-06-21  8:57     ` Andi Kleen
@ 2009-06-24  9:17       ` Hidehiro Kawai
  -1 siblings, 0 replies; 44+ messages in thread
From: Hidehiro Kawai @ 2009-06-24  9:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, Andrew Morton, LKML, hugh.dickins, npiggin,
	chris.mason, Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	linux-mm, Satoshi OSHIMA

Andi Kleen wrote:

>>introduce invalidate_inode_page() and don't remove dirty/writeback pages
>>from page cache (Nick, Fengguang)
> 
> I'm still dubious this is a good idea, it means potentially a lot 
> of pages not covered.

I think this is not bad idea for now.
Certainly we become unable to recover from uncorrected memory error
on dirty page cache pages, it will be safer than the old patch.

As for ext3 filesystem, unlike usual I/O error, the I/O error
generated by HWPOISON feature doesn't cause journal abort and
read-only remount even if the fs has been mounted with data=ordered
and data_err=abort options.  So we can re-read the old data and
update the fs after failing to write out dirty data, this may
cause an integrity problem.  And improper data can be exposed
by the re-read if it's a newly allocated data block.

The amount of dirty pages are not so many because it is limited
by dirty_ratio or dirty_bytes.  HWPOISON feature is still
useful even if it doesn't cover dirty page cache pages. :-)

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8
@ 2009-06-24  9:17       ` Hidehiro Kawai
  0 siblings, 0 replies; 44+ messages in thread
From: Hidehiro Kawai @ 2009-06-24  9:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, Andrew Morton, LKML, hugh.dickins, npiggin,
	chris.mason, Rik van Riel, Andi Kleen, Ingo Molnar, Minchan Kim,
	Mel Gorman, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	linux-mm, Satoshi OSHIMA

Andi Kleen wrote:

>>introduce invalidate_inode_page() and don't remove dirty/writeback pages
>>from page cache (Nick, Fengguang)
> 
> I'm still dubious this is a good idea, it means potentially a lot 
> of pages not covered.

I think this is not bad idea for now.
Certainly we become unable to recover from uncorrected memory error
on dirty page cache pages, it will be safer than the old patch.

As for ext3 filesystem, unlike usual I/O error, the I/O error
generated by HWPOISON feature doesn't cause journal abort and
read-only remount even if the fs has been mounted with data=ordered
and data_err=abort options.  So we can re-read the old data and
update the fs after failing to write out dirty data, this may
cause an integrity problem.  And improper data can be exposed
by the re-read if it's a newly allocated data block.

The amount of dirty pages are not so many because it is limited
by dirty_ratio or dirty_bytes.  HWPOISON feature is still
useful even if it doesn't cover dirty page cache pages. :-)

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-06-24  9:17 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-20  3:16 [PATCH 00/15] HWPOISON: Intro (v6) Wu Fengguang
2009-06-20  3:16 ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 01/15] HWPOISON: Add page flag for poisoned pages Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 02/15] HWPOISON: Export some rmap vma locking to outside world Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 03/15] HWPOISON: Add support for poison swap entries v2 Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 04/15] HWPOISON: Add new SIGBUS error codes for hardware poison signals Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 05/15] HWPOISON: Add basic support for poisoned pages in fault handler v3 Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 06/15] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 07/15] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 08/15] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 09/15] HWPOISON: Handle hardware poisoned pages in try_to_unmap Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 10/15] HWPOISON: check and isolate corrupted free pages v3 Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 11/15] HWPOISON: The high level memory error handler in the VM v8 Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-21  8:57   ` Andi Kleen
2009-06-21  8:57     ` Andi Kleen
2009-06-22  9:02     ` Wu Fengguang
2009-06-22  9:02       ` Wu Fengguang
2009-06-22  9:37     ` Wu Fengguang
2009-06-22  9:37       ` Wu Fengguang
2009-06-24  9:17     ` Hidehiro Kawai
2009-06-24  9:17       ` Hidehiro Kawai
2009-06-20  3:16 ` [PATCH 12/15] HWPOISON: per process early kill option prctl(PR_MEMORY_FAILURE_EARLY_KILL) Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-21  8:52   ` Andi Kleen
2009-06-21  8:52     ` Andi Kleen
2009-06-22  9:27     ` Wu Fengguang
2009-06-22  9:27       ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 13/15] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 14/15] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang
2009-06-20  3:16 ` [PATCH 15/15] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Wu Fengguang
2009-06-20  3:16   ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.