All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [0/16] POISON: Intro
@ 2009-04-07 15:09 ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Upcoming Intel CPUs have support for recovering from some memory errors. This
requires the OS to declare a page "poisoned", kill the processes associated
with it and avoid using it in the future. This patchkit implements
the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

-Andi

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [0/16] POISON: Intro
@ 2009-04-07 15:09 ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Upcoming Intel CPUs have support for recovering from some memory errors. This
requires the OS to declare a page "poisoned", kill the processes associated
with it and avoid using it in the future. This patchkit implements
the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [1/16] POISON: Add support for high priority work items
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:09   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


The machine check poison handling needs to go to process context very 
quickly.  Add a new high priority queueing mechanism for work items.
This should be only used in exceptional cases! (but a machine check
is definitely exceptional)

The insert is not fully O(1) in regards to other high priority
items, but those should be rather rare anyways.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/workqueue.h |    3 +++
 kernel/workqueue.c        |   15 +++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux/include/linux/workqueue.h
===================================================================
--- linux.orig/include/linux/workqueue.h	2009-04-07 16:39:28.000000000 +0200
+++ linux/include/linux/workqueue.h	2009-04-07 16:39:39.000000000 +0200
@@ -25,6 +25,7 @@
 struct work_struct {
 	atomic_long_t data;
 #define WORK_STRUCT_PENDING 0		/* T if work item pending execution */
+#define WORK_STRUCT_HIGHPRI 1		/* work is high priority */
 #define WORK_STRUCT_FLAG_MASK (3UL)
 #define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
 	struct list_head entry;
@@ -163,6 +164,8 @@
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
 
+#define set_work_highpri(work) \
+	set_bit(WORK_STRUCT_HIGHPRI, work_data_bits(work))
 
 extern struct workqueue_struct *
 __create_workqueue_key(const char *name, int singlethread,
Index: linux/kernel/workqueue.c
===================================================================
--- linux.orig/kernel/workqueue.c	2009-04-07 16:39:28.000000000 +0200
+++ linux/kernel/workqueue.c	2009-04-07 16:39:39.000000000 +0200
@@ -132,6 +132,21 @@
 	 * result of list_add() below, see try_to_grab_pending().
 	 */
 	smp_wmb();
+	/*
+	 * Insert after last high priority item. This avoids
+	 * them starving each other.
+	 * High priority items should be rare, so it's ok to not have
+	 * O(1) insert for them.
+	 */
+	if (test_bit(WORK_STRUCT_HIGHPRI, work_data_bits(work)) &&
+		!list_empty(head)) {
+		struct work_struct *w;
+		list_for_each_entry (w, head, entry) {
+			if (!test_bit(WORK_STRUCT_HIGHPRI, work_data_bits(w)))
+				break;
+		}
+		head = &w->entry;
+	}
 	list_add_tail(&work->entry, head);
 	wake_up(&cwq->more_work);
 }

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [1/16] POISON: Add support for high priority work items
@ 2009-04-07 15:09   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


The machine check poison handling needs to go to process context very 
quickly.  Add a new high priority queueing mechanism for work items.
This should be only used in exceptional cases! (but a machine check
is definitely exceptional)

The insert is not fully O(1) in regards to other high priority
items, but those should be rather rare anyways.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/workqueue.h |    3 +++
 kernel/workqueue.c        |   15 +++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux/include/linux/workqueue.h
===================================================================
--- linux.orig/include/linux/workqueue.h	2009-04-07 16:39:28.000000000 +0200
+++ linux/include/linux/workqueue.h	2009-04-07 16:39:39.000000000 +0200
@@ -25,6 +25,7 @@
 struct work_struct {
 	atomic_long_t data;
 #define WORK_STRUCT_PENDING 0		/* T if work item pending execution */
+#define WORK_STRUCT_HIGHPRI 1		/* work is high priority */
 #define WORK_STRUCT_FLAG_MASK (3UL)
 #define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
 	struct list_head entry;
@@ -163,6 +164,8 @@
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
 
+#define set_work_highpri(work) \
+	set_bit(WORK_STRUCT_HIGHPRI, work_data_bits(work))
 
 extern struct workqueue_struct *
 __create_workqueue_key(const char *name, int singlethread,
Index: linux/kernel/workqueue.c
===================================================================
--- linux.orig/kernel/workqueue.c	2009-04-07 16:39:28.000000000 +0200
+++ linux/kernel/workqueue.c	2009-04-07 16:39:39.000000000 +0200
@@ -132,6 +132,21 @@
 	 * result of list_add() below, see try_to_grab_pending().
 	 */
 	smp_wmb();
+	/*
+	 * Insert after last high priority item. This avoids
+	 * them starving each other.
+	 * High priority items should be rare, so it's ok to not have
+	 * O(1) insert for them.
+	 */
+	if (test_bit(WORK_STRUCT_HIGHPRI, work_data_bits(work)) &&
+		!list_empty(head)) {
+		struct work_struct *w;
+		list_for_each_entry (w, head, entry) {
+			if (!test_bit(WORK_STRUCT_HIGHPRI, work_data_bits(w)))
+				break;
+		}
+		head = &w->entry;
+	}
 	list_add_tail(&work->entry, head);
 	wake_up(&cwq->more_work);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:09   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Poisoned pages need special handling in the VM and shouldn't be touched 
again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one. I hope.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h	2009-04-07 16:39:27.000000000 +0200
+++ linux/include/linux/page-flags.h	2009-04-07 16:39:39.000000000 +0200
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_poison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -104,6 +107,9 @@
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_poison,		/* poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +279,14 @@
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(Poison, poison)
+#define __PG_POISON (1UL << PG_poison)
+#else
+PAGEFLAG_FALSE(Poison)
+#define __PG_POISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +417,7 @@
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 __PG_UNEVICTABLE | __PG_MLOCKED)
+	 __PG_POISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-07 15:09   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Poisoned pages need special handling in the VM and shouldn't be touched 
again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one. I hope.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h	2009-04-07 16:39:27.000000000 +0200
+++ linux/include/linux/page-flags.h	2009-04-07 16:39:39.000000000 +0200
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_poison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -104,6 +107,9 @@
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_poison,		/* poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +279,14 @@
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(Poison, poison)
+#define __PG_POISON (1UL << PG_poison)
+#else
+PAGEFLAG_FALSE(Poison)
+#define __PG_POISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +417,7 @@
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 __PG_UNEVICTABLE | __PG_MLOCKED)
+	 __PG_POISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [3/16] POISON: Handle poisoned pages in page free
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:09   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Make sure no poisoned pages are put back into the free page
lists.  This can happen with some races.

This is allo slow path in the bad page bits path, so another
check doesn't really matter.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c	2009-04-07 16:39:26.000000000 +0200
+++ linux/mm/page_alloc.c	2009-04-07 16:39:39.000000000 +0200
@@ -228,6 +228,15 @@
 	static unsigned long nr_unshown;
 
 	/*
+	 * Page may have been marked bad before process is freeing it.
+	 * Make sure it is not put back into the free page lists.
+	 */
+	if (PagePoison(page)) {
+		/* check more flags here... */
+		return;
+	}
+
+	/*
 	 * Allow a burst of 60 reports, then keep quiet for that minute;
 	 * or allow a steady drip of one report per second.
 	 */

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [3/16] POISON: Handle poisoned pages in page free
@ 2009-04-07 15:09   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Make sure no poisoned pages are put back into the free page
lists.  This can happen with some races.

This is allo slow path in the bad page bits path, so another
check doesn't really matter.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c	2009-04-07 16:39:26.000000000 +0200
+++ linux/mm/page_alloc.c	2009-04-07 16:39:39.000000000 +0200
@@ -228,6 +228,15 @@
 	static unsigned long nr_unshown;
 
 	/*
+	 * Page may have been marked bad before process is freeing it.
+	 * Make sure it is not put back into the free page lists.
+	 */
+	if (PagePoison(page)) {
+		/* check more flags here... */
+		return;
+	}
+
+	/*
 	 * Allow a burst of 60 reports, then keep quiet for that minute;
 	 * or allow a steady drip of one report per second.
 	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [4/16] POISON: Export some rmap vma locking to outside world
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-04-07 16:39:26.000000000 +0200
+++ linux/include/linux/rmap.h	2009-04-07 16:43:06.000000000 +0200
@@ -118,6 +118,12 @@
 }
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-04-07 16:39:26.000000000 +0200
+++ linux/mm/rmap.c	2009-04-07 16:43:06.000000000 +0200
@@ -191,7 +191,7 @@
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [4/16] POISON: Export some rmap vma locking to outside world
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-04-07 16:39:26.000000000 +0200
+++ linux/include/linux/rmap.h	2009-04-07 16:43:06.000000000 +0200
@@ -118,6 +118,12 @@
 }
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-04-07 16:39:26.000000000 +0200
+++ linux/mm/rmap.c	2009-04-07 16:43:06.000000000 +0200
@@ -191,7 +191,7 @@
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [5/16] POISON: Add support for poison swap entries
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


CPU migration uses special swap entry types to trigger special actions on page
faults. Extend this mechanism to also support poisoned swap entries, to trigger
poison handling on page faults. This allows followon patches to prevent 
processes from faulting in poisoned pages again.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2009-04-07 16:39:25.000000000 +0200
+++ linux/include/linux/swap.h	2009-04-07 16:39:39.000000000 +0200
@@ -34,16 +34,38 @@
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_POISON_NUM + 1)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_POISON_NUM + 2)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_POISON_NUM 1
+#define SWP_POISON 		(MAX_SWAPFILES + 1)
+#else
+#define SWP_POISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_POISON_NUM)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h	2009-04-07 16:39:25.000000000 +0200
+++ linux/include/linux/swapops.h	2009-04-07 16:39:39.000000000 +0200
@@ -131,3 +131,41 @@
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for poisoned pages
+ */
+static inline swp_entry_t make_poison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_POISON, page_to_pfn(page));
+}
+
+static inline int is_poison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_POISON;
+}
+#else
+
+static inline swp_entry_t make_poison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_poison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) > MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2009-04-07 16:39:25.000000000 +0200
+++ linux/mm/swapfile.c	2009-04-07 16:39:39.000000000 +0200
@@ -579,7 +579,7 @@
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
 	unsigned long offset, type;
 	int result = 0;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	type = swp_type(entry);

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [5/16] POISON: Add support for poison swap entries
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


CPU migration uses special swap entry types to trigger special actions on page
faults. Extend this mechanism to also support poisoned swap entries, to trigger
poison handling on page faults. This allows followon patches to prevent 
processes from faulting in poisoned pages again.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2009-04-07 16:39:25.000000000 +0200
+++ linux/include/linux/swap.h	2009-04-07 16:39:39.000000000 +0200
@@ -34,16 +34,38 @@
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_POISON_NUM + 1)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_POISON_NUM + 2)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_POISON_NUM 1
+#define SWP_POISON 		(MAX_SWAPFILES + 1)
+#else
+#define SWP_POISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_POISON_NUM)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h	2009-04-07 16:39:25.000000000 +0200
+++ linux/include/linux/swapops.h	2009-04-07 16:39:39.000000000 +0200
@@ -131,3 +131,41 @@
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for poisoned pages
+ */
+static inline swp_entry_t make_poison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_POISON, page_to_pfn(page));
+}
+
+static inline int is_poison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_POISON;
+}
+#else
+
+static inline swp_entry_t make_poison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_poison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) > MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2009-04-07 16:39:25.000000000 +0200
+++ linux/mm/swapfile.c	2009-04-07 16:39:39.000000000 +0200
@@ -579,7 +579,7 @@
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
 	unsigned long offset, type;
 	int result = 0;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	type = swp_type(entry);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [6/16] POISON: Add new SIGBUS error codes for poison signals
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-abi, linux-kernel, linux-mm, x86


Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Cc: linux-abi@vger.kernel.org

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h	2009-04-07 16:39:24.000000000 +0200
+++ linux/include/asm-generic/siginfo.h	2009-04-07 16:39:39.000000000 +0200
@@ -82,6 +82,7 @@
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [6/16] POISON: Add new SIGBUS error codes for poison signals
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-abi, linux-kernel, linux-mm, x86


Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Cc: linux-abi@vger.kernel.org

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h	2009-04-07 16:39:24.000000000 +0200
+++ linux/include/asm-generic/siginfo.h	2009-04-07 16:39:39.000000000 +0200
@@ -82,6 +82,7 @@
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


- Add a new VM_FAULT_POISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   17 ++++++++++++++---
 2 files changed, 16 insertions(+), 4 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-04-07 16:39:24.000000000 +0200
+++ linux/mm/memory.c	2009-04-07 16:43:06.000000000 +0200
@@ -1315,7 +1315,8 @@
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_POISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2426,8 +2427,15 @@
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_poison_entry(entry)) {
+			ret = VM_FAULT_POISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2451,6 +2459,9 @@
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PagePoison(page)) {
+		ret = VM_FAULT_POISON;
+		goto out;
 	}
 
 	lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-04-07 16:39:24.000000000 +0200
+++ linux/include/linux/mm.h	2009-04-07 16:43:05.000000000 +0200
@@ -702,11 +702,12 @@
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_POISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_POISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


- Add a new VM_FAULT_POISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   17 ++++++++++++++---
 2 files changed, 16 insertions(+), 4 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-04-07 16:39:24.000000000 +0200
+++ linux/mm/memory.c	2009-04-07 16:43:06.000000000 +0200
@@ -1315,7 +1315,8 @@
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_POISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2426,8 +2427,15 @@
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_poison_entry(entry)) {
+			ret = VM_FAULT_POISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2451,6 +2459,9 @@
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PagePoison(page)) {
+		ret = VM_FAULT_POISON;
+		goto out;
 	}
 
 	lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-04-07 16:39:24.000000000 +0200
+++ linux/include/linux/mm.h	2009-04-07 16:43:05.000000000 +0200
@@ -702,11 +702,12 @@
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_POISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_POISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Bail out early when poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly
into processes.

This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
@@ -2560,6 +2560,10 @@
 		goto oom;
 	__SetPageUptodate(page);
 
+	/* Kludge for now until we take poisoned pages out of the free lists */
+	if (unlikely(PagePoison(page)))
+		return VM_FAULT_POISON;
+
 	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
@@ -2625,6 +2629,9 @@
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
 		return ret;
 
+	if (unlikely(PagePoison(vmf.page)))
+		return VM_FAULT_POISON;
+
 	/*
 	 * For consistency in subsequent calls, make the faulted page always
 	 * locked.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Bail out early when poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly
into processes.

This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
@@ -2560,6 +2560,10 @@
 		goto oom;
 	__SetPageUptodate(page);
 
+	/* Kludge for now until we take poisoned pages out of the free lists */
+	if (unlikely(PagePoison(page)))
+		return VM_FAULT_POISON;
+
 	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
@@ -2625,6 +2629,9 @@
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
 		return ret;
 
+	if (unlikely(PagePoison(vmf.page)))
+		return VM_FAULT_POISON;
+
 	/*
 	 * For consistency in subsequent calls, make the faulted page always
 	 * locked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [9/16] POISON: x86: Add VM_FAULT_POISON handling to x86 page fault handler
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Add VM_FAULT_POISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-04-07 16:39:23.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-04-07 16:39:39.000000000 +0200
@@ -189,6 +189,7 @@
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -827,10 +828,12 @@
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -846,7 +849,14 @@
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_POISON) {
+		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+			tsk->comm, tsk->pid);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -856,8 +866,8 @@
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_POISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [9/16] POISON: x86: Add VM_FAULT_POISON handling to x86 page fault handler
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Add VM_FAULT_POISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-04-07 16:39:23.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-04-07 16:39:39.000000000 +0200
@@ -189,6 +189,7 @@
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -827,10 +828,12 @@
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -846,7 +849,14 @@
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_POISON) {
+		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+			tsk->comm, tsk->pid);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -856,8 +866,8 @@
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_POISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86


try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so 
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to 
do.

This patch is supposed to be a nop in behaviour. If anyone can prove 
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-04-07 16:39:39.000000000 +0200
+++ linux/include/linux/rmap.h	2009-04-07 16:39:39.000000000 +0200
@@ -84,7 +84,19 @@
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/rmap.c	2009-04-07 16:43:06.000000000 +0200
@@ -755,7 +755,7 @@
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -777,11 +777,13 @@
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -821,12 +823,12 @@
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1040,8 +1043,7 @@
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1102,7 +1105,8 @@
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1222,8 +1226,8 @@
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 #endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c	2009-04-07 16:39:23.000000000 +0200
+++ linux/mm/vmscan.c	2009-04-07 16:39:39.000000000 +0200
@@ -663,7 +663,7 @@
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c	2009-04-07 16:39:23.000000000 +0200
+++ linux/mm/migrate.c	2009-04-07 16:39:39.000000000 +0200
@@ -669,7 +669,7 @@
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86


try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so 
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to 
do.

This patch is supposed to be a nop in behaviour. If anyone can prove 
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-04-07 16:39:39.000000000 +0200
+++ linux/include/linux/rmap.h	2009-04-07 16:39:39.000000000 +0200
@@ -84,7 +84,19 @@
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/rmap.c	2009-04-07 16:43:06.000000000 +0200
@@ -755,7 +755,7 @@
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -777,11 +777,13 @@
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -821,12 +823,12 @@
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1040,8 +1043,7 @@
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1102,7 +1105,8 @@
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1222,8 +1226,8 @@
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 #endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c	2009-04-07 16:39:23.000000000 +0200
+++ linux/mm/vmscan.c	2009-04-07 16:39:39.000000000 +0200
@@ -663,7 +663,7 @@
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c	2009-04-07 16:39:23.000000000 +0200
+++ linux/mm/migrate.c	2009-04-07 16:39:39.000000000 +0200
@@ -669,7 +669,7 @@
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [11/16] POISON: Handle poisoned pages in try_to_unmap
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: Lee.Schermerhorn, linux-kernel, linux-mm, x86


When a page has the poison bit set replace the PTE with a poison entry. 
This causes the right error handling to be done later when a process runs 
into it.

Cc: Lee.Schermerhorn@hp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/rmap.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/rmap.c	2009-04-07 16:39:39.000000000 +0200
@@ -801,7 +801,14 @@
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PagePoison(page)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_poison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [11/16] POISON: Handle poisoned pages in try_to_unmap
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: Lee.Schermerhorn, linux-kernel, linux-mm, x86


When a page has the poison bit set replace the PTE with a poison entry. 
This causes the right error handling to be done later when a process runs 
into it.

Cc: Lee.Schermerhorn@hp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/rmap.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/rmap.c	2009-04-07 16:39:39.000000000 +0200
@@ -801,7 +801,14 @@
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PagePoison(page)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_poison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [12/16] POISON: Handle poisoned pages in set_page_dirty()
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page-writeback.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2009-04-07 16:39:22.000000000 +0200
+++ linux/mm/page-writeback.c	2009-04-07 16:39:39.000000000 +0200
@@ -1277,6 +1277,10 @@
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (unlikely(PagePoison(page))) {
+		SetPageDirty(page);
+		return 0;
+	}
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
 #ifdef CONFIG_BLOCK

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [12/16] POISON: Handle poisoned pages in set_page_dirty()
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page-writeback.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2009-04-07 16:39:22.000000000 +0200
+++ linux/mm/page-writeback.c	2009-04-07 16:39:39.000000000 +0200
@@ -1277,6 +1277,10 @@
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (unlikely(PagePoison(page))) {
+		SetPageDirty(page);
+		return 0;
+	}
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
 #ifdef CONFIG_BLOCK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86


This patch adds the high level memory handler that poisons pages. 
It is portable code and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before 
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used 
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep 
everything together and rmap knowledge has been seeping out anyways

This isn't complete yet. The biggest gap is the missing hugepage 
handling and also a few other corner cases. The code is unable
in all cases to get rid of all references.

This is rather tricky code and needs a lot of review. Undoubtedly it still
has bugs.

Cc: hugh@veritas.com
Cc: npiggin@suse.de
Cc: riel@redhat.com
Cc: lee.schermerhorn@hp.com
Cc: akpm@linux-foundation.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/meminfo.c   |    9 
 include/linux/mm.h  |    4 
 kernel/sysctl.c     |   14 +
 mm/Kconfig          |    3 
 mm/Makefile         |    1 
 mm/memory-failure.c |  575 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 605 insertions(+), 1 deletion(-)

Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-04-07 16:39:21.000000000 +0200
+++ linux/mm/Makefile	2009-04-07 16:39:39.000000000 +0200
@@ -38,3 +38,4 @@
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c	2009-04-07 16:39:39.000000000 +0200
@@ -0,0 +1,575 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states.	The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * Some of the operations here are somewhat inefficient and have non
+ * linear algorithmic complexity, because the data structures have not
+ * been optimized for this case. This is in particular the case
+ * for the mapping from a VMA to a process. Since this case is expected
+ * to be rare we hope we can get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ *   + left over references when process catches signal?
+ * - error reporting on EIO missing (tinject)
+ * - kcore/oldmem/vmcore/mem/kmem check for poison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages;
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+		"MCE: Killing %s:%d due to hardware memory corruption\n",
+		t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	ret = force_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.  We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+		       struct vm_area_struct *vma,
+		       struct list_head *to_kill,
+		       struct to_kill **tkc)
+{
+	int fail = 0;
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	if (tk->addr == -EFAULT) {
+		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
+		tk->addr = 0;
+		fail = 1;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ */
+static void
+kill_procs_ao(struct list_head *to_kill, int doit, int trapno, int fail)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. So reset
+			 * the signal handlers
+			 */
+			if (fail)
+				flush_signal_handlers(tk->tsk, 1);
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			if (kill_proc_ao(tk->tsk, tk->addr, trapno) < 0)
+				printk(KERN_ERR
+		"MCE: Cannot send advisory machine check signal to %s:%d\n",
+						 tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av = page_lock_anon_vma(page);
+
+	if (av == NULL)	/* Not actually mapped anymore */
+		goto out;
+
+	read_lock(&tasklist_lock);
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	read_unlock(&tasklist_lock);
+out:
+	page_unlock_anon_vma(av);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page_mapping(page);
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+	/* memory allocation failure is implicitly handled */
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,
+	DELAYED,
+	IGNORED,
+	RECOVERED,
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+	printk(KERN_ERR "MCE: Unknown state page %lx flags %lx, count %d\n",
+	       page_to_pfn(p), p->flags, page_count(p));
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+	/* TBD Should delete page from buddy here. */
+	return IGNORED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+	struct address_space *mapping;
+
+	if (PagePrivate(p))
+		do_invalidatepage(p, 0);
+	mapping = page_mapping(p);
+	if (mapping) {
+		if (!remove_mapping(mapping, p))
+			return FAILED;
+	}
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	printk(KERN_ERR "MCE: Hardware memory corruption on dirty file page: write error\n");
+	if (mapping) {
+		/* CHECKME: does that report the error in all cases? */
+		mapping_set_error(mapping, EIO);
+	}
+	if (PagePrivate(p)) {
+		if (try_to_release_page(p, GFP_KERNEL)) {
+			/*
+			 * Normally this should not happen because we
+			 * have the lock.  What should we do
+			 * here. wait on the page? (TBD)
+			 */
+			printk(KERN_ERR
+			       "MCE: Trying to release dirty page failed\n");
+			return FAILED;
+		}
+	} else if (mapping) {
+		cancel_dirty_page(p, PAGE_CACHE_SIZE);
+	}
+	return me_pagecache_clean(p);
+}
+
+/*
+ * Dirty swap cache.
+ * Cannot map back to the process because the rmaps are gone. Instead we rely
+ * on any subsequent re-fault to run into the Poison bit. This is not optimal.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+	delete_from_swap_cache(p);
+	return DELAYED;
+}
+
+/*
+ * Clean swap cache.
+ */
+static int me_swapcache_clean(struct page *p)
+{
+	delete_from_swap_cache(p);
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle.
+ *
+ * This is not complete. More states could be added.
+ */
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p);
+} error_states[] = {
+#define F(x) (1UL << PG_ ## x)
+	{ F(reserved), F(reserved), "reserved kernel", me_ignore },
+	{ F(buddy), F(buddy), "free kernel", me_free },
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ F(slab), F(slab), "kernel slab", me_kernel },
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ F(head), F(head), "hugetlb", me_huge_page },
+	{ F(tail), F(tail), "hugetlb", me_huge_page },
+#else
+	{ F(compound), F(compound), "hugetlb", me_huge_page },
+#endif
+	{ F(swapcache)|F(dirty), F(swapcache)|F(dirty), "dirty swapcache",
+	  me_swapcache_dirty },
+	{ F(swapcache)|F(dirty), F(swapcache), "clean swapcache",
+	  me_swapcache_clean },
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ F(unevictable)|F(dirty), F(unevictable)|F(dirty),
+	  "unevictable dirty page cache", me_pagecache_dirty },
+	{ F(unevictable), F(unevictable), "unevictable page cache",
+	  me_pagecache_clean },
+#endif
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ F(mlocked)|F(dirty), F(mlocked)|F(dirty), "mlocked dirty page cache",
+	  me_pagecache_dirty },
+	{ F(mlocked), F(mlocked), "mlocked page cache", me_pagecache_clean },
+#endif
+	{ F(lru)|F(dirty), F(lru)|F(dirty), "dirty lru", me_pagecache_dirty },
+	{ F(lru)|F(dirty), F(lru), "clean lru", me_pagecache_clean },
+	{ F(swapbacked), F(swapbacked), "anonymous", me_pagecache_clean },
+	/*
+	 * More states could be added here.
+	 */
+	{ 0, 0, "unknown page state", me_unknown },  /* must be at end */
+#undef F
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+			unsigned long pfn)
+{
+	int ret;
+
+	printk(KERN_ERR
+	       "MCE: Starting recovery on %s page %lx corrupted by hardware\n",
+	       msg, pfn);
+	ret = action(p);
+	printk(KERN_ERR "MCE: Recovery of %s page %lx: %s\n",
+	       msg, pfn, action_name[ret]);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+       "MCE: Page %lx (flags %lx) still referenced by %d users after recovery\n",
+		       pfn, p->flags, page_count(p));
+
+	/* Could do more checks here if page looks ok */
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
+{
+	if (PagePoison(p)) {
+		printk(KERN_ERR
+		       "MCE: Error for already poisoned page at %lx\n", pfn);
+		return -1;
+	}
+	SetPagePoison(p);
+
+	if (!PageReserved(p) && !PageSlab(p) && page_mapped(p)) {
+		LIST_HEAD(tokill);
+		int ret;
+		int i;
+
+		/*
+		 * First collect all the processes that have the page
+		 * mapped.  This has to be done before try_to_unmap,
+		 * because ttu takes the rmap data structures down.
+		 *
+		 * Error handling: We ignore errors here because
+		 * there's nothing that can be done.
+		 *
+		 * RED-PEN some cases in process exit seem to deadlock
+		 * on the page lock. drop it or add poison checks?
+		 */
+		if (sysctl_memory_failure_early_kill)
+			collect_procs(p, &tokill);
+
+		/*
+		 * try_to_unmap can fail temporarily due to races.
+		 * Try a few times (RED-PEN better strategy?)
+		 */
+		for (i = 0; i < N_UNMAP_TRIES; i++) {
+			ret = try_to_unmap(p, TTU_UNMAP|
+					   TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+			if (ret == SWAP_SUCCESS)
+				break;
+			Dprintk("MCE: try_to_unmap retry needed %d\n", ret);
+		}
+
+		/*
+		 * Now that the dirty bit has been propagated to the
+		 * struct page and all unmaps done we can decide if
+		 * killing is needed or not.  Only kill when the page
+		 * was dirty, otherwise the tokill list is merely
+		 * freed.  When there was a problem unmapping earlier
+		 * use a more force-full uncatchable kill to prevent
+		 * any accesses to the poisoned memory.
+		 */
+		kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+			      ret != SWAP_SUCCESS);
+	}
+
+	return 0;
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	Dprintk("memory failure %lx\n", pfn);
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+   "MCE: Hardware memory corruption in memory outside kernel control at %lx\n",
+		       pfn);
+	} else {
+		struct page *p = pfn_to_page(pfn);
+		struct page_state *ps;
+
+		/*
+		 * Make sure no one frees the page outside our control.
+		 */
+		get_page(p);
+		lock_page_nosync(p);
+
+		if (poison_page_prepare(p, pfn, trapno) < 0)
+			goto out;
+
+		for (ps = error_states;; ps++) {
+			if ((p->flags & ps->mask) == ps->res) {
+				page_action(ps->msg, p, ps->action, pfn);
+				break;
+			}
+		}
+out:
+		unlock_page(p);
+	}
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-04-07 16:39:39.000000000 +0200
+++ linux/include/linux/mm.h	2009-04-07 16:39:39.000000000 +0200
@@ -1322,6 +1322,10 @@
 
 extern void *alloc_locked_buffer(size_t size);
 extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
 extern void release_locked_buffer(void *buffer, size_t size);
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/kernel/sysctl.c	2009-04-07 16:39:39.000000000 +0200
@@ -1266,6 +1266,20 @@
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "memory_failure_early_kill",
+		.data		= &sysctl_memory_failure_early_kill,
+		.maxlen		= sizeof(vm_highmem_is_dirtyable),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/fs/proc/meminfo.c	2009-04-07 16:39:39.000000000 +0200
@@ -97,7 +97,11 @@
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"BadPages:       %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -144,6 +148,9 @@
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-04-07 16:39:21.000000000 +0200
+++ linux/mm/Kconfig	2009-04-07 16:39:39.000000000 +0200
@@ -223,3 +223,6 @@
 
 config MMU_NOTIFIER
 	bool
+
+config MEMORY_FAILURE
+	bool

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86


This patch adds the high level memory handler that poisons pages. 
It is portable code and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before 
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used 
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep 
everything together and rmap knowledge has been seeping out anyways

This isn't complete yet. The biggest gap is the missing hugepage 
handling and also a few other corner cases. The code is unable
in all cases to get rid of all references.

This is rather tricky code and needs a lot of review. Undoubtedly it still
has bugs.

Cc: hugh@veritas.com
Cc: npiggin@suse.de
Cc: riel@redhat.com
Cc: lee.schermerhorn@hp.com
Cc: akpm@linux-foundation.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/meminfo.c   |    9 
 include/linux/mm.h  |    4 
 kernel/sysctl.c     |   14 +
 mm/Kconfig          |    3 
 mm/Makefile         |    1 
 mm/memory-failure.c |  575 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 605 insertions(+), 1 deletion(-)

Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-04-07 16:39:21.000000000 +0200
+++ linux/mm/Makefile	2009-04-07 16:39:39.000000000 +0200
@@ -38,3 +38,4 @@
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c	2009-04-07 16:39:39.000000000 +0200
@@ -0,0 +1,575 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states.	The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * Some of the operations here are somewhat inefficient and have non
+ * linear algorithmic complexity, because the data structures have not
+ * been optimized for this case. This is in particular the case
+ * for the mapping from a VMA to a process. Since this case is expected
+ * to be rare we hope we can get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ *   + left over references when process catches signal?
+ * - error reporting on EIO missing (tinject)
+ * - kcore/oldmem/vmcore/mem/kmem check for poison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages;
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+		"MCE: Killing %s:%d due to hardware memory corruption\n",
+		t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	ret = force_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.  We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+		       struct vm_area_struct *vma,
+		       struct list_head *to_kill,
+		       struct to_kill **tkc)
+{
+	int fail = 0;
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	if (tk->addr == -EFAULT) {
+		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
+		tk->addr = 0;
+		fail = 1;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ */
+static void
+kill_procs_ao(struct list_head *to_kill, int doit, int trapno, int fail)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. So reset
+			 * the signal handlers
+			 */
+			if (fail)
+				flush_signal_handlers(tk->tsk, 1);
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			if (kill_proc_ao(tk->tsk, tk->addr, trapno) < 0)
+				printk(KERN_ERR
+		"MCE: Cannot send advisory machine check signal to %s:%d\n",
+						 tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av = page_lock_anon_vma(page);
+
+	if (av == NULL)	/* Not actually mapped anymore */
+		goto out;
+
+	read_lock(&tasklist_lock);
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	read_unlock(&tasklist_lock);
+out:
+	page_unlock_anon_vma(av);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page_mapping(page);
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+	/* memory allocation failure is implicitly handled */
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,
+	DELAYED,
+	IGNORED,
+	RECOVERED,
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+	printk(KERN_ERR "MCE: Unknown state page %lx flags %lx, count %d\n",
+	       page_to_pfn(p), p->flags, page_count(p));
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+	/* TBD Should delete page from buddy here. */
+	return IGNORED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+	struct address_space *mapping;
+
+	if (PagePrivate(p))
+		do_invalidatepage(p, 0);
+	mapping = page_mapping(p);
+	if (mapping) {
+		if (!remove_mapping(mapping, p))
+			return FAILED;
+	}
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	printk(KERN_ERR "MCE: Hardware memory corruption on dirty file page: write error\n");
+	if (mapping) {
+		/* CHECKME: does that report the error in all cases? */
+		mapping_set_error(mapping, EIO);
+	}
+	if (PagePrivate(p)) {
+		if (try_to_release_page(p, GFP_KERNEL)) {
+			/*
+			 * Normally this should not happen because we
+			 * have the lock.  What should we do
+			 * here. wait on the page? (TBD)
+			 */
+			printk(KERN_ERR
+			       "MCE: Trying to release dirty page failed\n");
+			return FAILED;
+		}
+	} else if (mapping) {
+		cancel_dirty_page(p, PAGE_CACHE_SIZE);
+	}
+	return me_pagecache_clean(p);
+}
+
+/*
+ * Dirty swap cache.
+ * Cannot map back to the process because the rmaps are gone. Instead we rely
+ * on any subsequent re-fault to run into the Poison bit. This is not optimal.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+	delete_from_swap_cache(p);
+	return DELAYED;
+}
+
+/*
+ * Clean swap cache.
+ */
+static int me_swapcache_clean(struct page *p)
+{
+	delete_from_swap_cache(p);
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle.
+ *
+ * This is not complete. More states could be added.
+ */
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p);
+} error_states[] = {
+#define F(x) (1UL << PG_ ## x)
+	{ F(reserved), F(reserved), "reserved kernel", me_ignore },
+	{ F(buddy), F(buddy), "free kernel", me_free },
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ F(slab), F(slab), "kernel slab", me_kernel },
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ F(head), F(head), "hugetlb", me_huge_page },
+	{ F(tail), F(tail), "hugetlb", me_huge_page },
+#else
+	{ F(compound), F(compound), "hugetlb", me_huge_page },
+#endif
+	{ F(swapcache)|F(dirty), F(swapcache)|F(dirty), "dirty swapcache",
+	  me_swapcache_dirty },
+	{ F(swapcache)|F(dirty), F(swapcache), "clean swapcache",
+	  me_swapcache_clean },
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ F(unevictable)|F(dirty), F(unevictable)|F(dirty),
+	  "unevictable dirty page cache", me_pagecache_dirty },
+	{ F(unevictable), F(unevictable), "unevictable page cache",
+	  me_pagecache_clean },
+#endif
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ F(mlocked)|F(dirty), F(mlocked)|F(dirty), "mlocked dirty page cache",
+	  me_pagecache_dirty },
+	{ F(mlocked), F(mlocked), "mlocked page cache", me_pagecache_clean },
+#endif
+	{ F(lru)|F(dirty), F(lru)|F(dirty), "dirty lru", me_pagecache_dirty },
+	{ F(lru)|F(dirty), F(lru), "clean lru", me_pagecache_clean },
+	{ F(swapbacked), F(swapbacked), "anonymous", me_pagecache_clean },
+	/*
+	 * More states could be added here.
+	 */
+	{ 0, 0, "unknown page state", me_unknown },  /* must be at end */
+#undef F
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+			unsigned long pfn)
+{
+	int ret;
+
+	printk(KERN_ERR
+	       "MCE: Starting recovery on %s page %lx corrupted by hardware\n",
+	       msg, pfn);
+	ret = action(p);
+	printk(KERN_ERR "MCE: Recovery of %s page %lx: %s\n",
+	       msg, pfn, action_name[ret]);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+       "MCE: Page %lx (flags %lx) still referenced by %d users after recovery\n",
+		       pfn, p->flags, page_count(p));
+
+	/* Could do more checks here if page looks ok */
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
+{
+	if (PagePoison(p)) {
+		printk(KERN_ERR
+		       "MCE: Error for already poisoned page at %lx\n", pfn);
+		return -1;
+	}
+	SetPagePoison(p);
+
+	if (!PageReserved(p) && !PageSlab(p) && page_mapped(p)) {
+		LIST_HEAD(tokill);
+		int ret;
+		int i;
+
+		/*
+		 * First collect all the processes that have the page
+		 * mapped.  This has to be done before try_to_unmap,
+		 * because ttu takes the rmap data structures down.
+		 *
+		 * Error handling: We ignore errors here because
+		 * there's nothing that can be done.
+		 *
+		 * RED-PEN some cases in process exit seem to deadlock
+		 * on the page lock. drop it or add poison checks?
+		 */
+		if (sysctl_memory_failure_early_kill)
+			collect_procs(p, &tokill);
+
+		/*
+		 * try_to_unmap can fail temporarily due to races.
+		 * Try a few times (RED-PEN better strategy?)
+		 */
+		for (i = 0; i < N_UNMAP_TRIES; i++) {
+			ret = try_to_unmap(p, TTU_UNMAP|
+					   TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+			if (ret == SWAP_SUCCESS)
+				break;
+			Dprintk("MCE: try_to_unmap retry needed %d\n", ret);
+		}
+
+		/*
+		 * Now that the dirty bit has been propagated to the
+		 * struct page and all unmaps done we can decide if
+		 * killing is needed or not.  Only kill when the page
+		 * was dirty, otherwise the tokill list is merely
+		 * freed.  When there was a problem unmapping earlier
+		 * use a more force-full uncatchable kill to prevent
+		 * any accesses to the poisoned memory.
+		 */
+		kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+			      ret != SWAP_SUCCESS);
+	}
+
+	return 0;
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	Dprintk("memory failure %lx\n", pfn);
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+   "MCE: Hardware memory corruption in memory outside kernel control at %lx\n",
+		       pfn);
+	} else {
+		struct page *p = pfn_to_page(pfn);
+		struct page_state *ps;
+
+		/*
+		 * Make sure no one frees the page outside our control.
+		 */
+		get_page(p);
+		lock_page_nosync(p);
+
+		if (poison_page_prepare(p, pfn, trapno) < 0)
+			goto out;
+
+		for (ps = error_states;; ps++) {
+			if ((p->flags & ps->mask) == ps->res) {
+				page_action(ps->msg, p, ps->action, pfn);
+				break;
+			}
+		}
+out:
+		unlock_page(p);
+	}
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-04-07 16:39:39.000000000 +0200
+++ linux/include/linux/mm.h	2009-04-07 16:39:39.000000000 +0200
@@ -1322,6 +1322,10 @@
 
 extern void *alloc_locked_buffer(size_t size);
 extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
 extern void release_locked_buffer(void *buffer, size_t size);
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/kernel/sysctl.c	2009-04-07 16:39:39.000000000 +0200
@@ -1266,6 +1266,20 @@
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "memory_failure_early_kill",
+		.data		= &sysctl_memory_failure_early_kill,
+		.maxlen		= sizeof(vm_highmem_is_dirtyable),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/fs/proc/meminfo.c	2009-04-07 16:39:39.000000000 +0200
@@ -97,7 +97,11 @@
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"BadPages:       %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -144,6 +148,9 @@
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-04-07 16:39:21.000000000 +0200
+++ linux/mm/Kconfig	2009-04-07 16:39:39.000000000 +0200
@@ -223,3 +223,6 @@
 
 config MMU_NOTIFIER
 	bool
+
+config MEMORY_FAILURE
+	bool

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [14/16] x86: MCE: Rename mce_notify_user to mce_notify_irq
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Impact: cleanup

Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context 
and of process context and it's better to give it a clearer
name for this.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/include/asm/mce.h                |    2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c   |    2 +-
 arch/x86/kernel/cpu/mcheck/mce_64.c       |   10 +++++-----
 arch/x86/kernel/cpu/mcheck/mce_intel_64.c |    2 +-
 arch/x86/kernel/signal.c                  |    2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:43:04.000000000 +0200
@@ -303,14 +303,14 @@
 	ack_APIC_irq();
 	exit_idle();
 	irq_enter();
-	mce_notify_user();
+	mce_notify_irq();
 	irq_exit();
 }
 
 static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
-		mce_notify_user();
+		mce_notify_irq();
 		return;
 	}
 
@@ -904,7 +904,7 @@
 	 * polling interval, otherwise increase the polling interval.
 	 */
 	n = &__get_cpu_var(next_interval);
-	if (mce_notify_user()) {
+	if (mce_notify_irq()) {
 		*n = max(*n/2, HZ/100);
 	} else {
 		*n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));
@@ -926,7 +926,7 @@
  * Can be called from interrupt context, but not from machine check/NMI
  * context.
  */
-int mce_notify_user(void)
+int mce_notify_irq(void)
 {
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
@@ -950,7 +950,7 @@
 	}
 	return 0;
 }
-EXPORT_SYMBOL_GPL(mce_notify_user);
+EXPORT_SYMBOL_GPL(mce_notify_irq);
 
 /*
  * Initialize Machine Checks for a CPU.
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h	2009-04-07 16:43:04.000000000 +0200
@@ -162,7 +162,7 @@
 };
 extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
-extern int mce_notify_user(void);
+extern int mce_notify_irq(void);
 
 #endif /* !CONFIG_X86_32 */
 
Index: linux/arch/x86/kernel/cpu/mcheck/mce-inject.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-inject.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-inject.c	2009-04-07 16:39:39.000000000 +0200
@@ -65,7 +65,7 @@
 		memset(&b, 0xff, sizeof(mce_banks_t));
 		printk(KERN_INFO "Starting machine check poll CPU %d\n", cpu);
 		machine_check_poll(0, &b);
-		mce_notify_user();
+		mce_notify_irq();
 		printk(KERN_INFO "Finished machine check poll on CPU %d\n",
 		       cpu);
 	}
Index: linux/arch/x86/kernel/signal.c
===================================================================
--- linux.orig/arch/x86/kernel/signal.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/signal.c	2009-04-07 16:43:04.000000000 +0200
@@ -860,7 +860,7 @@
 #if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_user();
+		mce_notify_irq();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
 	/* deal with pending signal delivery */
Index: linux/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_intel_64.c	2009-04-07 16:39:39.000000000 +0200
@@ -132,7 +132,7 @@
 static void intel_threshold_interrupt(void)
 {
 	machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
-	mce_notify_user();
+	mce_notify_irq();
 }
 
 static void print_update(char *type, int *hdr, int num)

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [14/16] x86: MCE: Rename mce_notify_user to mce_notify_irq
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Impact: cleanup

Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context 
and of process context and it's better to give it a clearer
name for this.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/include/asm/mce.h                |    2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c   |    2 +-
 arch/x86/kernel/cpu/mcheck/mce_64.c       |   10 +++++-----
 arch/x86/kernel/cpu/mcheck/mce_intel_64.c |    2 +-
 arch/x86/kernel/signal.c                  |    2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:43:04.000000000 +0200
@@ -303,14 +303,14 @@
 	ack_APIC_irq();
 	exit_idle();
 	irq_enter();
-	mce_notify_user();
+	mce_notify_irq();
 	irq_exit();
 }
 
 static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
-		mce_notify_user();
+		mce_notify_irq();
 		return;
 	}
 
@@ -904,7 +904,7 @@
 	 * polling interval, otherwise increase the polling interval.
 	 */
 	n = &__get_cpu_var(next_interval);
-	if (mce_notify_user()) {
+	if (mce_notify_irq()) {
 		*n = max(*n/2, HZ/100);
 	} else {
 		*n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));
@@ -926,7 +926,7 @@
  * Can be called from interrupt context, but not from machine check/NMI
  * context.
  */
-int mce_notify_user(void)
+int mce_notify_irq(void)
 {
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
@@ -950,7 +950,7 @@
 	}
 	return 0;
 }
-EXPORT_SYMBOL_GPL(mce_notify_user);
+EXPORT_SYMBOL_GPL(mce_notify_irq);
 
 /*
  * Initialize Machine Checks for a CPU.
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h	2009-04-07 16:43:04.000000000 +0200
@@ -162,7 +162,7 @@
 };
 extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
-extern int mce_notify_user(void);
+extern int mce_notify_irq(void);
 
 #endif /* !CONFIG_X86_32 */
 
Index: linux/arch/x86/kernel/cpu/mcheck/mce-inject.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-inject.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-inject.c	2009-04-07 16:39:39.000000000 +0200
@@ -65,7 +65,7 @@
 		memset(&b, 0xff, sizeof(mce_banks_t));
 		printk(KERN_INFO "Starting machine check poll CPU %d\n", cpu);
 		machine_check_poll(0, &b);
-		mce_notify_user();
+		mce_notify_irq();
 		printk(KERN_INFO "Finished machine check poll on CPU %d\n",
 		       cpu);
 	}
Index: linux/arch/x86/kernel/signal.c
===================================================================
--- linux.orig/arch/x86/kernel/signal.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/signal.c	2009-04-07 16:43:04.000000000 +0200
@@ -860,7 +860,7 @@
 #if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_user();
+		mce_notify_irq();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
 	/* deal with pending signal delivery */
Index: linux/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c	2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_intel_64.c	2009-04-07 16:39:39.000000000 +0200
@@ -132,7 +132,7 @@
 static void intel_threshold_interrupt(void)
 {
 	machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
-	mce_notify_user();
+	mce_notify_irq();
 }
 
 static void print_update(char *type, int *hdr, int num)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [15/16] x86: MCE: Support action-optional machine checks
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Newer Intel CPUs support a new class of machine checks called recoverable
action optional.

Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.

This is done by the new generic high level memory failure handler added in a
earlier patch. The high level handler takes the address with the failed
memory and does the appropiate action, like killing the process.

The high level handler cannot be directly called from the machine check 
exception though, because it has to run in a defined process context to be able
to sleep when taking VM locks (it is not expected to sleep for a long time,
just do so in some exceptional cases like lock contention) 

Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.

This patch adds two path to process context: through a per thread kernel exit
notify_user() callback or through a high priority work item.  The first
runs when the process exits back to user space, the other when it goes
to sleep and there is no higher priority process. 

The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this 
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.

There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with 
the corrupted PFNs.

The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all 
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.

Most of the required checking has been already done earlier in the
mce_severity rule checking engine.  Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/Kconfig                          |    1 
 arch/x86/include/asm/irq_vectors.h        |    1 
 arch/x86/include/asm/mce.h                |    1 
 arch/x86/kernel/cpu/mcheck/mce-severity.c |    8 +-
 arch/x86/kernel/cpu/mcheck/mce_64.c       |  114 ++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c                  |    2 
 6 files changed, 125 insertions(+), 2 deletions(-)

Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:39:39.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/rcupdate.h>
+#include <linux/mm.h>
 #include <linux/kallsyms.h>
 #include <linux/sysdev.h>
 #include <linux/miscdevice.h>
@@ -79,6 +80,8 @@
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
 };
 
+static DEFINE_PER_CPU(struct work_struct, mce_work);
+
 /* Do initial initialization of a struct mce */
 void mce_setup(struct mce *m)
 {
@@ -273,6 +276,52 @@
 	wrmsrl(msr, v);
 }
 
+/*
+ * Simple lockless ring to communicate PFNs from the exception handler with the
+ * process context work function. This is vastly simplified because there's
+ * only a single reader and a single writer.
+ */
+#define MCE_RING_SIZE 16	/* we use one entry less */
+
+struct mce_ring {
+	unsigned short start;
+	unsigned short end;
+	unsigned long ring[MCE_RING_SIZE];
+};
+static DEFINE_PER_CPU(struct mce_ring, mce_ring);
+
+static int mce_ring_empty(void)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+	return r->start == r->end;
+}
+
+static int mce_ring_get(unsigned long *pfn)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+	if (r->start == r->end)
+		return 0;
+	*pfn = r->ring[r->start];
+	r->start = (r->start + 1) % MCE_RING_SIZE;
+	return 1;
+}
+
+static int mce_ring_add(unsigned long pfn)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+	unsigned next;
+
+	next = (r->end + 1) % MCE_RING_SIZE;
+	if (next == r->start)
+		return -1;
+	r->ring[r->end] = pfn;
+	wmb();
+	r->end = next;
+	return 0;
+}
+
 int mce_available(struct cpuinfo_x86 *c)
 {
 	if (mce_dont_init)
@@ -293,6 +342,15 @@
 		m->ip = mce_rdmsrl(rip_msr);
 }
 
+static void mce_schedule_work(void)
+{
+	if (!mce_ring_empty()) {
+		struct work_struct *work = &__get_cpu_var(mce_work);
+		if (!work_pending(work))
+			schedule_work(work);
+	}
+}
+
 /*
  * Called after interrupts have been reenabled again
  * when a MCE happened during an interrupts off region
@@ -304,6 +362,7 @@
 	exit_idle();
 	irq_enter();
 	mce_notify_irq();
+	mce_schedule_work();
 	irq_exit();
 }
 
@@ -311,6 +370,13 @@
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
 		mce_notify_irq();
+		/*
+		 * Triggering the work queue here is just an insurance
+		 * policy in case the syscall exit notify handler
+		 * doesn't run soon enough or ends up running on the
+		 * wrong CPU (can happen when audit sleeps)
+		 */
+		mce_schedule_work();
 		return;
 	}
 
@@ -669,6 +735,23 @@
 	return ret;
 }
 
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses upto page granuality for now.
+ */
+static int mce_usable_address(struct mce *m)
+{
+	if (!(m->status & MCI_STATUS_MISCV) || !(m->status & MCI_STATUS_ADDRV))
+		return 0;
+	if ((m->misc & 0x3f) > PAGE_SHIFT)
+		return 0;
+	if (((m->misc >> 6) & 7) != MCM_ADDR_PHYS)
+		return 0;
+	return 1;
+}
+
 static void mce_clear_state(unsigned long *toclear)
 {
 	int i;
@@ -802,6 +885,16 @@
 		if (m.status & MCI_STATUS_ADDRV)
 			m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
 
+		/*
+		 * Action optional error. Queue address for later processing.
+		 * When the ring overflows we just ignore the AO error.
+		 * RED-PEN add some logging mechanism when
+		 * usable_address or mce_add_ring fails.
+		 * RED-PEN don't ignore overflow for tolerant == 0
+		 */
+		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
+			mce_ring_add(m.addr >> PAGE_SHIFT);
+
 		mce_get_rip(&m, regs);
 		mce_log(&m);
 
@@ -852,6 +945,26 @@
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
+/*
+ * Called after mce notification in process context. This code
+ * is allowed to sleep. Call the high level VM handler to process
+ * any corrupted pages.
+ * Assume that the work queue code only calls this one at a time
+ * per CPU.
+ */
+void mce_notify_process(void)
+{
+	unsigned long pfn;
+	mce_notify_irq();
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR);
+}
+
+static void mce_process_work(struct work_struct *dummy)
+{
+	mce_notify_process();
+}
+
 #ifdef CONFIG_X86_MCE_INTEL
 /***
  * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
@@ -1088,6 +1201,7 @@
 	mce_init();
 	mce_cpu_features(c);
 	mce_init_timer();
+	INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
 }
 
 /*
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h	2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h	2009-04-07 16:39:39.000000000 +0200
@@ -163,6 +163,7 @@
 extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 extern int mce_notify_irq(void);
+extern void mce_notify_process(void);
 
 #endif /* !CONFIG_X86_32 */
 
Index: linux/arch/x86/kernel/signal.c
===================================================================
--- linux.orig/arch/x86/kernel/signal.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/kernel/signal.c	2009-04-07 16:39:39.000000000 +0200
@@ -860,7 +860,7 @@
 #if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_irq();
+		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
 	/* deal with pending signal delivery */
Index: linux/arch/x86/kernel/cpu/mcheck/mce-severity.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-severity.c	2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-severity.c	2009-04-07 16:39:39.000000000 +0200
@@ -67,7 +67,13 @@
 	     "Action required; unknown MCACOD", SER),
 	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
 	     "Action required with lost events", SER),
-	/* AO add known MCACODs here */
+
+	/* known AO MCACODs: handle by calling high level handler */
+	MASK(MCI_UC_SAR|0xfff0, MCI_UC_S|0xc0, AO,
+	     "Action optional: memory scrubbing error", SER),
+	MASK(MCI_UC_SAR|MCACOD, MCI_UC_S|0x17a, AO,
+	     "Action optional: last level cache writeback error", SER),
+
 	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
 	     "Action optional unknown MCACOD", SER),
 	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h	2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/include/asm/irq_vectors.h	2009-04-07 16:39:39.000000000 +0200
@@ -25,6 +25,7 @@
  */
 
 #define NMI_VECTOR			0x02
+#define MCE_VECTOR			0x12
 
 /*
  * IDT vectors usable for external interrupt sources start
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig	2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/Kconfig	2009-04-07 16:39:39.000000000 +0200
@@ -760,6 +760,7 @@
 
 config X86_MCE
 	bool "Machine Check Exception"
+	select MEMORY_FAILURE
 	---help---
 	  Machine Check Exception support allows the processor to notify the
 	  kernel if it detects a problem (e.g. overheating, component failure).

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [15/16] x86: MCE: Support action-optional machine checks
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Newer Intel CPUs support a new class of machine checks called recoverable
action optional.

Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.

This is done by the new generic high level memory failure handler added in a
earlier patch. The high level handler takes the address with the failed
memory and does the appropiate action, like killing the process.

The high level handler cannot be directly called from the machine check 
exception though, because it has to run in a defined process context to be able
to sleep when taking VM locks (it is not expected to sleep for a long time,
just do so in some exceptional cases like lock contention) 

Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.

This patch adds two path to process context: through a per thread kernel exit
notify_user() callback or through a high priority work item.  The first
runs when the process exits back to user space, the other when it goes
to sleep and there is no higher priority process. 

The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this 
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.

There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with 
the corrupted PFNs.

The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all 
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.

Most of the required checking has been already done earlier in the
mce_severity rule checking engine.  Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/Kconfig                          |    1 
 arch/x86/include/asm/irq_vectors.h        |    1 
 arch/x86/include/asm/mce.h                |    1 
 arch/x86/kernel/cpu/mcheck/mce-severity.c |    8 +-
 arch/x86/kernel/cpu/mcheck/mce_64.c       |  114 ++++++++++++++++++++++++++++++
 arch/x86/kernel/signal.c                  |    2 
 6 files changed, 125 insertions(+), 2 deletions(-)

Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c	2009-04-07 16:39:39.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/rcupdate.h>
+#include <linux/mm.h>
 #include <linux/kallsyms.h>
 #include <linux/sysdev.h>
 #include <linux/miscdevice.h>
@@ -79,6 +80,8 @@
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
 };
 
+static DEFINE_PER_CPU(struct work_struct, mce_work);
+
 /* Do initial initialization of a struct mce */
 void mce_setup(struct mce *m)
 {
@@ -273,6 +276,52 @@
 	wrmsrl(msr, v);
 }
 
+/*
+ * Simple lockless ring to communicate PFNs from the exception handler with the
+ * process context work function. This is vastly simplified because there's
+ * only a single reader and a single writer.
+ */
+#define MCE_RING_SIZE 16	/* we use one entry less */
+
+struct mce_ring {
+	unsigned short start;
+	unsigned short end;
+	unsigned long ring[MCE_RING_SIZE];
+};
+static DEFINE_PER_CPU(struct mce_ring, mce_ring);
+
+static int mce_ring_empty(void)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+	return r->start == r->end;
+}
+
+static int mce_ring_get(unsigned long *pfn)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+	if (r->start == r->end)
+		return 0;
+	*pfn = r->ring[r->start];
+	r->start = (r->start + 1) % MCE_RING_SIZE;
+	return 1;
+}
+
+static int mce_ring_add(unsigned long pfn)
+{
+	struct mce_ring *r = &__get_cpu_var(mce_ring);
+	unsigned next;
+
+	next = (r->end + 1) % MCE_RING_SIZE;
+	if (next == r->start)
+		return -1;
+	r->ring[r->end] = pfn;
+	wmb();
+	r->end = next;
+	return 0;
+}
+
 int mce_available(struct cpuinfo_x86 *c)
 {
 	if (mce_dont_init)
@@ -293,6 +342,15 @@
 		m->ip = mce_rdmsrl(rip_msr);
 }
 
+static void mce_schedule_work(void)
+{
+	if (!mce_ring_empty()) {
+		struct work_struct *work = &__get_cpu_var(mce_work);
+		if (!work_pending(work))
+			schedule_work(work);
+	}
+}
+
 /*
  * Called after interrupts have been reenabled again
  * when a MCE happened during an interrupts off region
@@ -304,6 +362,7 @@
 	exit_idle();
 	irq_enter();
 	mce_notify_irq();
+	mce_schedule_work();
 	irq_exit();
 }
 
@@ -311,6 +370,13 @@
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
 		mce_notify_irq();
+		/*
+		 * Triggering the work queue here is just an insurance
+		 * policy in case the syscall exit notify handler
+		 * doesn't run soon enough or ends up running on the
+		 * wrong CPU (can happen when audit sleeps)
+		 */
+		mce_schedule_work();
 		return;
 	}
 
@@ -669,6 +735,23 @@
 	return ret;
 }
 
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses upto page granuality for now.
+ */
+static int mce_usable_address(struct mce *m)
+{
+	if (!(m->status & MCI_STATUS_MISCV) || !(m->status & MCI_STATUS_ADDRV))
+		return 0;
+	if ((m->misc & 0x3f) > PAGE_SHIFT)
+		return 0;
+	if (((m->misc >> 6) & 7) != MCM_ADDR_PHYS)
+		return 0;
+	return 1;
+}
+
 static void mce_clear_state(unsigned long *toclear)
 {
 	int i;
@@ -802,6 +885,16 @@
 		if (m.status & MCI_STATUS_ADDRV)
 			m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
 
+		/*
+		 * Action optional error. Queue address for later processing.
+		 * When the ring overflows we just ignore the AO error.
+		 * RED-PEN add some logging mechanism when
+		 * usable_address or mce_add_ring fails.
+		 * RED-PEN don't ignore overflow for tolerant == 0
+		 */
+		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
+			mce_ring_add(m.addr >> PAGE_SHIFT);
+
 		mce_get_rip(&m, regs);
 		mce_log(&m);
 
@@ -852,6 +945,26 @@
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
+/*
+ * Called after mce notification in process context. This code
+ * is allowed to sleep. Call the high level VM handler to process
+ * any corrupted pages.
+ * Assume that the work queue code only calls this one at a time
+ * per CPU.
+ */
+void mce_notify_process(void)
+{
+	unsigned long pfn;
+	mce_notify_irq();
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR);
+}
+
+static void mce_process_work(struct work_struct *dummy)
+{
+	mce_notify_process();
+}
+
 #ifdef CONFIG_X86_MCE_INTEL
 /***
  * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
@@ -1088,6 +1201,7 @@
 	mce_init();
 	mce_cpu_features(c);
 	mce_init_timer();
+	INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
 }
 
 /*
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h	2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h	2009-04-07 16:39:39.000000000 +0200
@@ -163,6 +163,7 @@
 extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 extern int mce_notify_irq(void);
+extern void mce_notify_process(void);
 
 #endif /* !CONFIG_X86_32 */
 
Index: linux/arch/x86/kernel/signal.c
===================================================================
--- linux.orig/arch/x86/kernel/signal.c	2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/kernel/signal.c	2009-04-07 16:39:39.000000000 +0200
@@ -860,7 +860,7 @@
 #if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_irq();
+		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
 	/* deal with pending signal delivery */
Index: linux/arch/x86/kernel/cpu/mcheck/mce-severity.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-severity.c	2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-severity.c	2009-04-07 16:39:39.000000000 +0200
@@ -67,7 +67,13 @@
 	     "Action required; unknown MCACOD", SER),
 	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
 	     "Action required with lost events", SER),
-	/* AO add known MCACODs here */
+
+	/* known AO MCACODs: handle by calling high level handler */
+	MASK(MCI_UC_SAR|0xfff0, MCI_UC_S|0xc0, AO,
+	     "Action optional: memory scrubbing error", SER),
+	MASK(MCI_UC_SAR|MCACOD, MCI_UC_S|0x17a, AO,
+	     "Action optional: last level cache writeback error", SER),
+
 	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
 	     "Action optional unknown MCACOD", SER),
 	MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h	2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/include/asm/irq_vectors.h	2009-04-07 16:39:39.000000000 +0200
@@ -25,6 +25,7 @@
  */
 
 #define NMI_VECTOR			0x02
+#define MCE_VECTOR			0x12
 
 /*
  * IDT vectors usable for external interrupt sources start
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig	2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/Kconfig	2009-04-07 16:39:39.000000000 +0200
@@ -760,6 +760,7 @@
 
 config X86_MCE
 	bool "Machine Check Exception"
+	select MEMORY_FAILURE
 	---help---
 	  Machine Check Exception support allows the processor to notify the
 	  kernel if it detects a problem (e.g. overheating, component failure).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [16/16] POISON: Add madvise() based injector for poisoned data
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 15:10   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman.h |    1 +
 mm/madvise.c               |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c	2009-04-07 16:36:29.000000000 +0200
+++ linux/mm/madvise.c	2009-04-07 16:39:39.000000000 +0200
@@ -208,6 +208,38 @@
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_poison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+ 		 * RED-PEN page can be reused, but otherwise we'll have to fight with the
+		 * refcnt
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_POISON)
+		return madvise_poison(start, start+len_in);
+#endif
+
 	write = madvise_need_mmap_write(behavior);
 	if (write)
 		down_write(&current->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h	2009-04-07 16:36:29.000000000 +0200
+++ linux/include/asm-generic/mman.h	2009-04-07 16:39:39.000000000 +0200
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_POISON	12		/* poison the page (root only) */
 
 /* compatibility flags */
 #define MAP_FILE	0

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH] [16/16] POISON: Add madvise() based injector for poisoned data
@ 2009-04-07 15:10   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86


Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman.h |    1 +
 mm/madvise.c               |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c	2009-04-07 16:36:29.000000000 +0200
+++ linux/mm/madvise.c	2009-04-07 16:39:39.000000000 +0200
@@ -208,6 +208,38 @@
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_poison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+ 		 * RED-PEN page can be reused, but otherwise we'll have to fight with the
+		 * refcnt
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_POISON)
+		return madvise_poison(start, start+len_in);
+#endif
+
 	write = madvise_need_mmap_write(behavior);
 	if (write)
 		down_write(&current->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h	2009-04-07 16:36:29.000000000 +0200
+++ linux/include/asm-generic/mman.h	2009-04-07 16:39:39.000000000 +0200
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_POISON	12		/* poison the page (root only) */
 
 /* compatibility flags */
 #define MAP_FILE	0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-07 15:10   ` Andi Kleen
@ 2009-04-07 16:03     ` Rik van Riel
  -1 siblings, 0 replies; 150+ messages in thread
From: Rik van Riel @ 2009-04-07 16:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

Andi Kleen wrote:

> This is rather tricky code and needs a lot of review. Undoubtedly it still
> has bugs.

It's just complex enough that it looks like it might have
more bugs, but I sure couldn't find any.

Hitting a bug in this code seems favorable to hitting
guaranteed memory corruption, so I hope Andrew or Ingo
will merge this into one of their trees.

> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-07 16:03     ` Rik van Riel
  0 siblings, 0 replies; 150+ messages in thread
From: Rik van Riel @ 2009-04-07 16:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

Andi Kleen wrote:

> This is rather tricky code and needs a lot of review. Undoubtedly it still
> has bugs.

It's just complex enough that it looks like it might have
more bugs, but I sure couldn't find any.

Hitting a bug in this code seems favorable to hitting
guaranteed memory corruption, so I hope Andrew or Ingo
will merge this into one of their trees.

> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-07 16:03     ` Rik van Riel
@ 2009-04-07 16:30       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 16:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, hugh, npiggin, lee.schermerhorn, akpm, linux-kernel,
	linux-mm, x86

On Tue, Apr 07, 2009 at 12:03:00PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
> 
> >This is rather tricky code and needs a lot of review. Undoubtedly it still
> >has bugs.
> 
> It's just complex enough that it looks like it might have
> more bugs, but I sure couldn't find any.

Thanks for the review.

Perhaps I didn't put it strongly enough: I know there are still bugs 
in there (e.g. nonlinear mappings deadlock and there are some cases
where the reference count of the page doesn't drop the zero).

> Hitting a bug in this code seems favorable to hitting
> guaranteed memory corruption, so I hope Andrew or Ingo

Yes the alternative is always panic() when the hardware detects
the consumed corruption and bails out.  So even if this code is buggy it's 
very likely still an improvement. So it would be reasonable to
do a relatively early merge and improve further in tree.

> >Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> Acked-by: Rik van Riel <riel@redhat.com>

Thanks added 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-07 16:30       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 16:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, hugh, npiggin, lee.schermerhorn, akpm, linux-kernel,
	linux-mm, x86

On Tue, Apr 07, 2009 at 12:03:00PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
> 
> >This is rather tricky code and needs a lot of review. Undoubtedly it still
> >has bugs.
> 
> It's just complex enough that it looks like it might have
> more bugs, but I sure couldn't find any.

Thanks for the review.

Perhaps I didn't put it strongly enough: I know there are still bugs 
in there (e.g. nonlinear mappings deadlock and there are some cases
where the reference count of the page doesn't drop the zero).

> Hitting a bug in this code seems favorable to hitting
> guaranteed memory corruption, so I hope Andrew or Ingo

Yes the alternative is always panic() when the hardware detects
the consumed corruption and bails out.  So even if this code is buggy it's 
very likely still an improvement. So it would be reasonable to
do a relatively early merge and improve further in tree.

> >Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> Acked-by: Rik van Riel <riel@redhat.com>

Thanks added 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-07 15:10   ` Andi Kleen
@ 2009-04-07 18:51     ` Johannes Weiner
  -1 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 18:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

Hi Andi,

On Tue, Apr 07, 2009 at 05:10:10PM +0200, Andi Kleen wrote:

> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct anon_vma *av = page_lock_anon_vma(page);
> +
> +	if (av == NULL)	/* Not actually mapped anymore */
> +		goto out;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process (tsk) {
> +		if (!tsk->mm)
> +			continue;
> +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +		}
> +	}
> +	read_unlock(&tasklist_lock);
> +out:
> +	page_unlock_anon_vma(av);

If !av, this doesn't need an unlock and in fact crashes due to
dereferencing NULL.

> +static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
> +{
> +	if (PagePoison(p)) {
> +		printk(KERN_ERR
> +		       "MCE: Error for already poisoned page at %lx\n", pfn);
> +		return -1;
> +	}
> +	SetPagePoison(p);

TestSetPagePoison()?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-07 18:51     ` Johannes Weiner
  0 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 18:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

Hi Andi,

On Tue, Apr 07, 2009 at 05:10:10PM +0200, Andi Kleen wrote:

> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct anon_vma *av = page_lock_anon_vma(page);
> +
> +	if (av == NULL)	/* Not actually mapped anymore */
> +		goto out;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process (tsk) {
> +		if (!tsk->mm)
> +			continue;
> +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +		}
> +	}
> +	read_unlock(&tasklist_lock);
> +out:
> +	page_unlock_anon_vma(av);

If !av, this doesn't need an unlock and in fact crashes due to
dereferencing NULL.

> +static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
> +{
> +	if (PagePoison(p)) {
> +		printk(KERN_ERR
> +		       "MCE: Error for already poisoned page at %lx\n", pfn);
> +		return -1;
> +	}
> +	SetPagePoison(p);

TestSetPagePoison()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
  2009-04-07 15:10   ` Andi Kleen
@ 2009-04-07 19:03     ` Johannes Weiner
  -1 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 19:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> 
> Bail out early when poisoned pages are found in page fault handling.
> Since they are poisoned they should not be mapped freshly
> into processes.
> 
> This is generally handled in the same way as OOM, just a different
> error code is returned to the architecture code.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  mm/memory.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> +++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> @@ -2560,6 +2560,10 @@
>  		goto oom;
>  	__SetPageUptodate(page);
>  
> +	/* Kludge for now until we take poisoned pages out of the free lists */
> +	if (unlikely(PagePoison(page)))
> +		return VM_FAULT_POISON;
> +

When memory_failure() hits a page still on the free list
(!page_count()) then the get_page() in memory_failure() will trigger a
VM_BUG.  So either this check is unneeded or it should be
get_page_unless_zero() in memory_failure()?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
@ 2009-04-07 19:03     ` Johannes Weiner
  0 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 19:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> 
> Bail out early when poisoned pages are found in page fault handling.
> Since they are poisoned they should not be mapped freshly
> into processes.
> 
> This is generally handled in the same way as OOM, just a different
> error code is returned to the architecture code.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  mm/memory.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> +++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> @@ -2560,6 +2560,10 @@
>  		goto oom;
>  	__SetPageUptodate(page);
>  
> +	/* Kludge for now until we take poisoned pages out of the free lists */
> +	if (unlikely(PagePoison(page)))
> +		return VM_FAULT_POISON;
> +

When memory_failure() hits a page still on the free list
(!page_count()) then the get_page() in memory_failure() will trigger a
VM_BUG.  So either this check is unneeded or it should be
get_page_unless_zero() in memory_failure()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-07 19:13   ` Robin Holt
  -1 siblings, 0 replies; 150+ messages in thread
From: Robin Holt @ 2009-04-07 19:13 UTC (permalink / raw)
  To: Andi Kleen, Russ Anderson; +Cc: linux-kernel, linux-mm, x86

How does this overlap with the bad page quarantine that ia64 uses
following an MCA?

Robin

On Tue, Apr 07, 2009 at 05:09:56PM +0200, Andi Kleen wrote:
> 
> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.
> 
> To quote the overview comment:
> 
>  * High level machine check handler. Handles pages reported by the
>  * hardware as being corrupted usually due to a 2bit ECC memory or cache
>  * failure.
>  *
>  * This focusses on pages detected as corrupted in the background.
>  * When the current CPU tries to consume corruption the currently
>  * running process can just be killed directly instead. This implies
>  * that if the error cannot be handled for some reason it's safe to
>  * just ignore it because no corruption has been consumed yet. Instead
>  * when that happens another machine check will happen.
>  *
>  * Handles page cache pages in various states. The tricky part
>  * here is that we can access any page asynchronous to other VM
>  * users, because memory failures could happen anytime and anywhere,
>  * possibly violating some of their assumptions. This is why this code
>  * has to be extremely careful. Generally it tries to use normal locking
>  * rules, as in get the standard locks, even if that means the
>  * error handling takes potentially a long time.
>  *
>  * Some of the operations here are somewhat inefficient and have non
>  * linear algorithmic complexity, because the data structures have not
>  * been optimized for this case. This is in particular the case
>  * for the mapping from a vma to a process. Since this case is expected
>  * to be rare we hope we can get away with this.
> 
> The code consists of a the high level handler in mm/memory-failure.c, 
> a new page poison bit and various checks in the VM to handle poisoned
> pages.
> 
> The main target right now is KVM guests, but it works for all kinds
> of applications.
> 
> For the KVM use there was need for a new signal type so that
> KVM can inject the machine check into the guest with the proper
> address. This in theory allows other applications to handle
> memory failures too. The expection is that near all applications
> won't do that, but some very specialized ones might. 
> 
> This is not fully complete yet, in particular there are still ways
> to access poison through various ways (crash dump, /proc/kcore etc.)
> that need to be plugged too.
> 
> Also undoubtedly the high level handler still has bugs and cases
> it cannot recover from. For example nonlinear mappings deadlock right now
> and a few other cases lose references. Huge pages are not supported
> yet. Any additional testing, reviewing etc. welcome. 
> 
> The patch series requires the earlier x86 MCE feature series for the x86
> specific action optional part. The code can be tested without the x86 specific
> part using the injector, this only requires to enable the Kconfig entry
> manually in some Kconfig file (by default it is implicitely enabled
> by the architecture)
> 
> -Andi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-07 19:13   ` Robin Holt
  0 siblings, 0 replies; 150+ messages in thread
From: Robin Holt @ 2009-04-07 19:13 UTC (permalink / raw)
  To: Andi Kleen, Russ Anderson; +Cc: linux-kernel, linux-mm, x86

How does this overlap with the bad page quarantine that ia64 uses
following an MCA?

Robin

On Tue, Apr 07, 2009 at 05:09:56PM +0200, Andi Kleen wrote:
> 
> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.
> 
> To quote the overview comment:
> 
>  * High level machine check handler. Handles pages reported by the
>  * hardware as being corrupted usually due to a 2bit ECC memory or cache
>  * failure.
>  *
>  * This focusses on pages detected as corrupted in the background.
>  * When the current CPU tries to consume corruption the currently
>  * running process can just be killed directly instead. This implies
>  * that if the error cannot be handled for some reason it's safe to
>  * just ignore it because no corruption has been consumed yet. Instead
>  * when that happens another machine check will happen.
>  *
>  * Handles page cache pages in various states. The tricky part
>  * here is that we can access any page asynchronous to other VM
>  * users, because memory failures could happen anytime and anywhere,
>  * possibly violating some of their assumptions. This is why this code
>  * has to be extremely careful. Generally it tries to use normal locking
>  * rules, as in get the standard locks, even if that means the
>  * error handling takes potentially a long time.
>  *
>  * Some of the operations here are somewhat inefficient and have non
>  * linear algorithmic complexity, because the data structures have not
>  * been optimized for this case. This is in particular the case
>  * for the mapping from a vma to a process. Since this case is expected
>  * to be rare we hope we can get away with this.
> 
> The code consists of a the high level handler in mm/memory-failure.c, 
> a new page poison bit and various checks in the VM to handle poisoned
> pages.
> 
> The main target right now is KVM guests, but it works for all kinds
> of applications.
> 
> For the KVM use there was need for a new signal type so that
> KVM can inject the machine check into the guest with the proper
> address. This in theory allows other applications to handle
> memory failures too. The expection is that near all applications
> won't do that, but some very specialized ones might. 
> 
> This is not fully complete yet, in particular there are still ways
> to access poison through various ways (crash dump, /proc/kcore etc.)
> that need to be plugged too.
> 
> Also undoubtedly the high level handler still has bugs and cases
> it cannot recover from. For example nonlinear mappings deadlock right now
> and a few other cases lose references. Huge pages are not supported
> yet. Any additional testing, reviewing etc. welcome. 
> 
> The patch series requires the earlier x86 MCE feature series for the x86
> specific action optional part. The code can be tested without the x86 specific
> part using the injector, this only requires to enable the Kconfig entry
> manually in some Kconfig file (by default it is implicitely enabled
> by the architecture)
> 
> -Andi
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
  2009-04-07 19:03     ` Johannes Weiner
@ 2009-04-07 19:31       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 19:31 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 09:03:30PM +0200, Johannes Weiner wrote:
> On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> > 
> > Bail out early when poisoned pages are found in page fault handling.
> > Since they are poisoned they should not be mapped freshly
> > into processes.
> > 
> > This is generally handled in the same way as OOM, just a different
> > error code is returned to the architecture code.
> > 
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > ---
> >  mm/memory.c |    7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > Index: linux/mm/memory.c
> > ===================================================================
> > --- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > +++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > @@ -2560,6 +2560,10 @@
> >  		goto oom;
> >  	__SetPageUptodate(page);
> >  
> > +	/* Kludge for now until we take poisoned pages out of the free lists */
> > +	if (unlikely(PagePoison(page)))
> > +		return VM_FAULT_POISON;
> > +
> 
> When memory_failure() hits a page still on the free list

It won't free it then. Later on it will take it out of the free lists,
but that code is not written yet.

> (!page_count()) then the get_page() in memory_failure() will trigger a
> VM_BUG.  So either this check is unneeded or it should be

So no bug
> get_page_unless_zero() in memory_failure()?

That's not what this is handling.  The issue is that sometimes
the process can still freeing it and we need to make sure it 
never hits the free lists.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
@ 2009-04-07 19:31       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 19:31 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 09:03:30PM +0200, Johannes Weiner wrote:
> On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> > 
> > Bail out early when poisoned pages are found in page fault handling.
> > Since they are poisoned they should not be mapped freshly
> > into processes.
> > 
> > This is generally handled in the same way as OOM, just a different
> > error code is returned to the architecture code.
> > 
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > ---
> >  mm/memory.c |    7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > Index: linux/mm/memory.c
> > ===================================================================
> > --- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > +++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > @@ -2560,6 +2560,10 @@
> >  		goto oom;
> >  	__SetPageUptodate(page);
> >  
> > +	/* Kludge for now until we take poisoned pages out of the free lists */
> > +	if (unlikely(PagePoison(page)))
> > +		return VM_FAULT_POISON;
> > +
> 
> When memory_failure() hits a page still on the free list

It won't free it then. Later on it will take it out of the free lists,
but that code is not written yet.

> (!page_count()) then the get_page() in memory_failure() will trigger a
> VM_BUG.  So either this check is unneeded or it should be

So no bug
> get_page_unless_zero() in memory_failure()?

That's not what this is handling.  The issue is that sometimes
the process can still freeing it and we need to make sure it 
never hits the free lists.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-07 19:13   ` Robin Holt
@ 2009-04-07 19:38     ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 19:38 UTC (permalink / raw)
  To: Robin Holt; +Cc: Andi Kleen, Russ Anderson, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 02:13:00PM -0500, Robin Holt wrote:
> How does this overlap with the bad page quarantine that ia64 uses
> following an MCA?

It's much more comprehensive than what ia64 has, mostly due to 
differing requirements. It also doesn't limit itself to user
mapped anonymous pages only.

-Andi

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-07 19:38     ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 19:38 UTC (permalink / raw)
  To: Robin Holt; +Cc: Andi Kleen, Russ Anderson, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 02:13:00PM -0500, Robin Holt wrote:
> How does this overlap with the bad page quarantine that ia64 uses
> following an MCA?

It's much more comprehensive than what ia64 has, mostly due to 
differing requirements. It also doesn't limit itself to user
mapped anonymous pages only.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-07 18:51     ` Johannes Weiner
@ 2009-04-07 19:40       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 19:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 08:51:46PM +0200, Johannes Weiner wrote:
> > +
> > +	if (av == NULL)	/* Not actually mapped anymore */
> > +		goto out;
> > +
> > +	read_lock(&tasklist_lock);
> > +	for_each_process (tsk) {
> > +		if (!tsk->mm)
> > +			continue;
> > +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> > +			if (vma->vm_mm == tsk->mm)
> > +				add_to_kill(tsk, page, vma, to_kill, tkc);
> > +		}
> > +	}
> > +	read_unlock(&tasklist_lock);
> > +out:
> > +	page_unlock_anon_vma(av);
> 
> If !av, this doesn't need an unlock and in fact crashes due to
> dereferencing NULL.

Good point. Fixed. Thanks.
> 
> > +static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
> > +{
> > +	if (PagePoison(p)) {
> > +		printk(KERN_ERR
> > +		       "MCE: Error for already poisoned page at %lx\n", pfn);
> > +		return -1;
> > +	}
> > +	SetPagePoison(p);
> 
> TestSetPagePoison()?

It doesn't matter in this case because it doesn't need to be atomic.
The normal reason for TestSet is atomicity requirements. If someone
feels strongly about it I can add it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-07 19:40       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 19:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 08:51:46PM +0200, Johannes Weiner wrote:
> > +
> > +	if (av == NULL)	/* Not actually mapped anymore */
> > +		goto out;
> > +
> > +	read_lock(&tasklist_lock);
> > +	for_each_process (tsk) {
> > +		if (!tsk->mm)
> > +			continue;
> > +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> > +			if (vma->vm_mm == tsk->mm)
> > +				add_to_kill(tsk, page, vma, to_kill, tkc);
> > +		}
> > +	}
> > +	read_unlock(&tasklist_lock);
> > +out:
> > +	page_unlock_anon_vma(av);
> 
> If !av, this doesn't need an unlock and in fact crashes due to
> dereferencing NULL.

Good point. Fixed. Thanks.
> 
> > +static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
> > +{
> > +	if (PagePoison(p)) {
> > +		printk(KERN_ERR
> > +		       "MCE: Error for already poisoned page at %lx\n", pfn);
> > +		return -1;
> > +	}
> > +	SetPagePoison(p);
> 
> TestSetPagePoison()?

It doesn't matter in this case because it doesn't need to be atomic.
The normal reason for TestSet is atomicity requirements. If someone
feels strongly about it I can add it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
  2009-04-07 19:31       ` Andi Kleen
@ 2009-04-07 20:17         ` Johannes Weiner
  -1 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 20:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 09:31:45PM +0200, Andi Kleen wrote:
> On Tue, Apr 07, 2009 at 09:03:30PM +0200, Johannes Weiner wrote:
> > On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> > > 
> > > Bail out early when poisoned pages are found in page fault handling.
> > > Since they are poisoned they should not be mapped freshly
> > > into processes.
> > > 
> > > This is generally handled in the same way as OOM, just a different
> > > error code is returned to the architecture code.
> > > 
> > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > 
> > > ---
> > >  mm/memory.c |    7 +++++++
> > >  1 file changed, 7 insertions(+)
> > > 
> > > Index: linux/mm/memory.c
> > > ===================================================================
> > > --- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > > +++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > > @@ -2560,6 +2560,10 @@
> > >  		goto oom;
> > >  	__SetPageUptodate(page);
> > >  
> > > +	/* Kludge for now until we take poisoned pages out of the free lists */
> > > +	if (unlikely(PagePoison(page)))
> > > +		return VM_FAULT_POISON;
> > > +
> > 
> > When memory_failure() hits a page still on the free list
> 
> It won't free it then. Later on it will take it out of the free lists,
> but that code is not written yet.
> 
> > (!page_count()) then the get_page() in memory_failure() will trigger a
> > VM_BUG.  So either this check is unneeded or it should be
> 
> So no bug
> > get_page_unless_zero() in memory_failure()?
> 
> That's not what this is handling.  The issue is that sometimes
> the process can still freeing it and we need to make sure it 
> never hits the free lists.

I think we missed each other here.  I wasn't talking about _why_ you
take that reference -- that is clear.  But I see these two
possibilities:

  a) memory_failure() is called on a page on the free list, the
  get_page() will trigger a bug because the refcount is 0

  b) if that is not possible, the above check is not needed

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
@ 2009-04-07 20:17         ` Johannes Weiner
  0 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 20:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 09:31:45PM +0200, Andi Kleen wrote:
> On Tue, Apr 07, 2009 at 09:03:30PM +0200, Johannes Weiner wrote:
> > On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> > > 
> > > Bail out early when poisoned pages are found in page fault handling.
> > > Since they are poisoned they should not be mapped freshly
> > > into processes.
> > > 
> > > This is generally handled in the same way as OOM, just a different
> > > error code is returned to the architecture code.
> > > 
> > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > 
> > > ---
> > >  mm/memory.c |    7 +++++++
> > >  1 file changed, 7 insertions(+)
> > > 
> > > Index: linux/mm/memory.c
> > > ===================================================================
> > > --- linux.orig/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > > +++ linux/mm/memory.c	2009-04-07 16:39:39.000000000 +0200
> > > @@ -2560,6 +2560,10 @@
> > >  		goto oom;
> > >  	__SetPageUptodate(page);
> > >  
> > > +	/* Kludge for now until we take poisoned pages out of the free lists */
> > > +	if (unlikely(PagePoison(page)))
> > > +		return VM_FAULT_POISON;
> > > +
> > 
> > When memory_failure() hits a page still on the free list
> 
> It won't free it then. Later on it will take it out of the free lists,
> but that code is not written yet.
> 
> > (!page_count()) then the get_page() in memory_failure() will trigger a
> > VM_BUG.  So either this check is unneeded or it should be
> 
> So no bug
> > get_page_unless_zero() in memory_failure()?
> 
> That's not what this is handling.  The issue is that sometimes
> the process can still freeing it and we need to make sure it 
> never hits the free lists.

I think we missed each other here.  I wasn't talking about _why_ you
take that reference -- that is clear.  But I see these two
possibilities:

  a) memory_failure() is called on a page on the free list, the
  get_page() will trigger a bug because the refcount is 0

  b) if that is not possible, the above check is not needed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
  2009-04-07 20:17         ` Johannes Weiner
@ 2009-04-07 20:24           ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 20:24 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

> I think we missed each other here.  I wasn't talking about _why_ you
> take that reference -- that is clear.  But I see these two
> possibilities:
> 
>   a) memory_failure() is called on a page on the free list, the
>   get_page() will trigger a bug because the refcount is 0

Ah got it now. Sorry for misreading you. That's indeed a problem.
Fixing.

free pages was something my injector based test suite didn't cover :/

>   b) if that is not possible, the above check is not needed

There was at least one case where the process could free it anyways.
I think. Or maybe that was something I fixed in a different way.
It's possible this check is not needed, but it's probably safer
to keep it (and it's all super slow path)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
@ 2009-04-07 20:24           ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 20:24 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

> I think we missed each other here.  I wasn't talking about _why_ you
> take that reference -- that is clear.  But I see these two
> possibilities:
> 
>   a) memory_failure() is called on a page on the free list, the
>   get_page() will trigger a bug because the refcount is 0

Ah got it now. Sorry for misreading you. That's indeed a problem.
Fixing.

free pages was something my injector based test suite didn't cover :/

>   b) if that is not possible, the above check is not needed

There was at least one case where the process could free it anyways.
I think. Or maybe that was something I fixed in a different way.
It's possible this check is not needed, but it's probably safer
to keep it (and it's all super slow path)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
  2009-04-07 20:24           ` Andi Kleen
@ 2009-04-07 20:36             ` Johannes Weiner
  -1 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 20:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:24:49PM +0200, Andi Kleen wrote:
> > I think we missed each other here.  I wasn't talking about _why_ you
> > take that reference -- that is clear.  But I see these two
> > possibilities:
> > 
> >   a) memory_failure() is called on a page on the free list, the
> >   get_page() will trigger a bug because the refcount is 0
> 
> Ah got it now. Sorry for misreading you. That's indeed a problem.
> Fixing.
> 
> free pages was something my injector based test suite didn't cover :/

Hm, perhaps walking mem_map and poisoning pages at random? :)

> >   b) if that is not possible, the above check is not needed
> 
> There was at least one case where the process could free it anyways.
> I think. Or maybe that was something I fixed in a different way.
> It's possible this check is not needed, but it's probably safer
> to keep it (and it's all super slow path)

Ok.  I first thought it could be useful to shrink the race window
between allocating the page and installing the pte but the rest of the
poisoning code should be able to cope.

	Hannes

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c
@ 2009-04-07 20:36             ` Johannes Weiner
  0 siblings, 0 replies; 150+ messages in thread
From: Johannes Weiner @ 2009-04-07 20:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:24:49PM +0200, Andi Kleen wrote:
> > I think we missed each other here.  I wasn't talking about _why_ you
> > take that reference -- that is clear.  But I see these two
> > possibilities:
> > 
> >   a) memory_failure() is called on a page on the free list, the
> >   get_page() will trigger a bug because the refcount is 0
> 
> Ah got it now. Sorry for misreading you. That's indeed a problem.
> Fixing.
> 
> free pages was something my injector based test suite didn't cover :/

Hm, perhaps walking mem_map and poisoning pages at random? :)

> >   b) if that is not possible, the above check is not needed
> 
> There was at least one case where the process could free it anyways.
> I think. Or maybe that was something I fixed in a different way.
> It's possible this check is not needed, but it's probably safer
> to keep it (and it's all super slow path)

Ok.  I first thought it could be useful to shrink the race window
between allocating the page and installing the pte but the rest of the
poisoning code should be able to cope.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-07 15:09   ` Andi Kleen
@ 2009-04-07 21:07     ` Christoph Lameter
  -1 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86




Acked-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-07 21:07     ` Christoph Lameter
  0 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86




Acked-by: Christoph Lameter <cl@linux.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
  2009-04-07 15:10   ` Andi Kleen
@ 2009-04-07 21:11     ` Christoph Lameter
  -1 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86


Could you separate the semantic changes to flag checking for migration
out for easier review?



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
@ 2009-04-07 21:11     ` Christoph Lameter
  0 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86


Could you separate the semantic changes to flag checking for migration
out for easier review?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
  2009-04-07 15:10   ` Andi Kleen
@ 2009-04-07 21:19     ` Christoph Lameter
  -1 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, 7 Apr 2009, Andi Kleen wrote:

> +
> +enum ttu_flags {
> +	TTU_UNMAP = 0,			/* unmap mode */
> +	TTU_MIGRATION = 1,		/* migration mode */
> +	TTU_MUNLOCK = 2,		/* munlock mode */
> +	TTU_ACTION_MASK = 0xff,
> +
> +	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */


Ignoring MLOCK? This means we are violating POSIX which says that an
MLOCKed page cannot be unmapped from a process? Note that page migration
does this under special pte entries so that the page will never appear to
be unmapped to user space.

How does that work for the poisoning case? We substitute a fresh page?


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-04-07 21:19     ` Christoph Lameter
  0 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, 7 Apr 2009, Andi Kleen wrote:

> +
> +enum ttu_flags {
> +	TTU_UNMAP = 0,			/* unmap mode */
> +	TTU_MIGRATION = 1,		/* migration mode */
> +	TTU_MUNLOCK = 2,		/* munlock mode */
> +	TTU_ACTION_MASK = 0xff,
> +
> +	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */


Ignoring MLOCK? This means we are violating POSIX which says that an
MLOCKed page cannot be unmapped from a process? Note that page migration
does this under special pte entries so that the page will never appear to
be unmapped to user space.

How does that work for the poisoning case? We substitute a fresh page?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
  2009-04-07 21:11     ` Christoph Lameter
@ 2009-04-07 21:56       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 21:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> 
> Could you separate the semantic changes to flag checking for migration

You mean to try_to_unmap? 

> out for easier review?

That's already done. The first patch doesn't change any semantics,
just flags/action checking.  Or rather any semantics change in there
would be a bug.

Only the two later ttu patches add to the semantics.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
@ 2009-04-07 21:56       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 21:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> 
> Could you separate the semantic changes to flag checking for migration

You mean to try_to_unmap? 

> out for easier review?

That's already done. The first patch doesn't change any semantics,
just flags/action checking.  Or rather any semantics change in there
would be a bug.

Only the two later ttu patches add to the semantics.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
  2009-04-07 21:56       ` Andi Kleen
@ 2009-04-07 21:56         ` Christoph Lameter
  -1 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, 7 Apr 2009, Andi Kleen wrote:

> On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> >
> > Could you separate the semantic changes to flag checking for migration
>
> You mean to try_to_unmap?

I mean the changes to checking the pte contents for a migratable /
swappable page. Those are significant independent from this patchset and
would be useful to review independently.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
@ 2009-04-07 21:56         ` Christoph Lameter
  0 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue, 7 Apr 2009, Andi Kleen wrote:

> On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> >
> > Could you separate the semantic changes to flag checking for migration
>
> You mean to try_to_unmap?

I mean the changes to checking the pte contents for a migratable /
swappable page. Those are significant independent from this patchset and
would be useful to review independently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
  2009-04-07 21:19     ` Christoph Lameter
@ 2009-04-07 21:59       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 21:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:19:19PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
> 
> > +
> > +enum ttu_flags {
> > +	TTU_UNMAP = 0,			/* unmap mode */
> > +	TTU_MIGRATION = 1,		/* migration mode */
> > +	TTU_MUNLOCK = 2,		/* munlock mode */
> > +	TTU_ACTION_MASK = 0xff,
> > +
> > +	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> 
> 
> Ignoring MLOCK? This means we are violating POSIX which says that an
> MLOCKed page cannot be unmapped from a process? 

I'm sure if you can find sufficiently vague language in the document 
to standards lawyer around that requirement @)

The alternative would be to panic. 

> Note that page migration
> does this under special pte entries so that the page will never appear to
> be unmapped to user space.
> 
> How does that work for the poisoning case? We substitute a fresh page?

It depends on the state of the page. If it was a clean disk mapped
page yes (it's just invalidated and can be reloaded). If it's a dirty anon 
page the process is normally killed first (with advisory mode on) or only
killed when it hits the corrupted page. The process can also
catch the signal if it choses so. The late killing works with 
a special entry similar to the migration case, but that results
in a special SIGBUS.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-04-07 21:59       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 21:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:19:19PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
> 
> > +
> > +enum ttu_flags {
> > +	TTU_UNMAP = 0,			/* unmap mode */
> > +	TTU_MIGRATION = 1,		/* migration mode */
> > +	TTU_MUNLOCK = 2,		/* munlock mode */
> > +	TTU_ACTION_MASK = 0xff,
> > +
> > +	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> 
> 
> Ignoring MLOCK? This means we are violating POSIX which says that an
> MLOCKed page cannot be unmapped from a process? 

I'm sure if you can find sufficiently vague language in the document 
to standards lawyer around that requirement @)

The alternative would be to panic. 

> Note that page migration
> does this under special pte entries so that the page will never appear to
> be unmapped to user space.
> 
> How does that work for the poisoning case? We substitute a fresh page?

It depends on the state of the page. If it was a clean disk mapped
page yes (it's just invalidated and can be reloaded). If it's a dirty anon 
page the process is normally killed first (with advisory mode on) or only
killed when it hits the corrupted page. The process can also
catch the signal if it choses so. The late killing works with 
a special entry similar to the migration case, but that results
in a special SIGBUS.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
  2009-04-07 21:59       ` Andi Kleen
@ 2009-04-07 22:04         ` Christoph Lameter
  -1 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 22:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, 7 Apr 2009, Andi Kleen wrote:

> > Ignoring MLOCK? This means we are violating POSIX which says that an
> > MLOCKed page cannot be unmapped from a process?
>
> I'm sure if you can find sufficiently vague language in the document
> to standards lawyer around that requirement @)
>
> The alternative would be to panic.


If you unmmap a MLOCKed page then you may get memory corruption because
f.e. the Infiniband layer is doing DMA to that page.

> > How does that work for the poisoning case? We substitute a fresh page?
>
> It depends on the state of the page. If it was a clean disk mapped
> page yes (it's just invalidated and can be reloaded). If it's a dirty anon
> page the process is normally killed first (with advisory mode on) or only
> killed when it hits the corrupted page. The process can also
> catch the signal if it choses so. The late killing works with
> a special entry similar to the migration case, but that results
> in a special SIGBUS.

I think a process needs to be killed if any MLOCKed page gets corrupted
because the OS cannot keep the POSIX guarantees.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-04-07 22:04         ` Christoph Lameter
  0 siblings, 0 replies; 150+ messages in thread
From: Christoph Lameter @ 2009-04-07 22:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, 7 Apr 2009, Andi Kleen wrote:

> > Ignoring MLOCK? This means we are violating POSIX which says that an
> > MLOCKed page cannot be unmapped from a process?
>
> I'm sure if you can find sufficiently vague language in the document
> to standards lawyer around that requirement @)
>
> The alternative would be to panic.


If you unmmap a MLOCKed page then you may get memory corruption because
f.e. the Infiniband layer is doing DMA to that page.

> > How does that work for the poisoning case? We substitute a fresh page?
>
> It depends on the state of the page. If it was a clean disk mapped
> page yes (it's just invalidated and can be reloaded). If it's a dirty anon
> page the process is normally killed first (with advisory mode on) or only
> killed when it hits the corrupted page. The process can also
> catch the signal if it choses so. The late killing works with
> a special entry similar to the migration case, but that results
> in a special SIGBUS.

I think a process needs to be killed if any MLOCKed page gets corrupted
because the OS cannot keep the POSIX guarantees.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
  2009-04-07 21:56         ` Christoph Lameter
@ 2009-04-07 22:25           ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 22:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:56:28PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
> 
> > On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> > >
> > > Could you separate the semantic changes to flag checking for migration
> >
> > You mean to try_to_unmap?
> 
> I mean the changes to checking the pte contents for a migratable /
> swappable page. Those are significant independent from this patchset and
> would be useful to review independently.

Sorry I'm still not quite sure what you're asking for.

Are you asking about the fault path or about try_to_unmap or some
other path?

And why do you want a separate patchset versus merely a separate patch?
(afaik the patches to generic code are already pretty separated)

I don't really change the semantics of the migration or swap code itself
for example. At least not consciously. If I did that would be a bug.

e.g. the changes to try_to_unmap are two stages:
- add flags/action code. Everything should still do the same, just
the flags are passed around differently.
- add a check for an already poisoned page and insert a poison
swap entry for those

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [5/16] POISON: Add support for poison swap entries
@ 2009-04-07 22:25           ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 22:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 05:56:28PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
> 
> > On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> > >
> > > Could you separate the semantic changes to flag checking for migration
> >
> > You mean to try_to_unmap?
> 
> I mean the changes to checking the pte contents for a migratable /
> swappable page. Those are significant independent from this patchset and
> would be useful to review independently.

Sorry I'm still not quite sure what you're asking for.

Are you asking about the fault path or about try_to_unmap or some
other path?

And why do you want a separate patchset versus merely a separate patch?
(afaik the patches to generic code are already pretty separated)

I don't really change the semantics of the migration or swap code itself
for example. At least not consciously. If I did that would be a bug.

e.g. the changes to try_to_unmap are two stages:
- add flags/action code. Everything should still do the same, just
the flags are passed around differently.
- add a check for an already poisoned page and insert a poison
swap entry for those

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
  2009-04-07 22:04         ` Christoph Lameter
@ 2009-04-07 22:35           ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 06:04:39PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
> 
> > > Ignoring MLOCK? This means we are violating POSIX which says that an
> > > MLOCKed page cannot be unmapped from a process?
> >
> > I'm sure if you can find sufficiently vague language in the document
> > to standards lawyer around that requirement @)
> >
> > The alternative would be to panic.
> 
> 
> If you unmmap a MLOCKed page then you may get memory corruption because
> f.e. the Infiniband layer is doing DMA to that page.

The page is not going away, it's poisoned in hardware and software 
and stays. There is currently no mechanism to unpoison pages without
rebooting.

DMA should actually cause a bus abort on the hardware level, 
at least for RMW.

I currently don't have a cancel mechanism for such kinds of mappings
though. It just does cancel_dirty_page(), but when IO is happening

In theory one could add a more forceful IO cancel mechanism using
special driver callbacks, but I'm not sure it's worth it. Normally the 
hardware should abort on hitting poison (although some might do strange things)
and you'll get some more (recoverable) machine checks.

> > > How does that work for the poisoning case? We substitute a fresh page?
> >
> > It depends on the state of the page. If it was a clean disk mapped
> > page yes (it's just invalidated and can be reloaded). If it's a dirty anon
> > page the process is normally killed first (with advisory mode on) or only
> > killed when it hits the corrupted page. The process can also
> > catch the signal if it choses so. The late killing works with
> > a special entry similar to the migration case, but that results
> > in a special SIGBUS.
> 
> I think a process needs to be killed if any MLOCKed page gets corrupted
> because the OS cannot keep the POSIX guarantees.

That's the default behaviour with vm.memory_failure_early_kill = 1
However the process can catch the signal if it wants.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-04-07 22:35           ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-07 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Lee.Schermerhorn, npiggin, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 06:04:39PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
> 
> > > Ignoring MLOCK? This means we are violating POSIX which says that an
> > > MLOCKed page cannot be unmapped from a process?
> >
> > I'm sure if you can find sufficiently vague language in the document
> > to standards lawyer around that requirement @)
> >
> > The alternative would be to panic.
> 
> 
> If you unmmap a MLOCKed page then you may get memory corruption because
> f.e. the Infiniband layer is doing DMA to that page.

The page is not going away, it's poisoned in hardware and software 
and stays. There is currently no mechanism to unpoison pages without
rebooting.

DMA should actually cause a bus abort on the hardware level, 
at least for RMW.

I currently don't have a cancel mechanism for such kinds of mappings
though. It just does cancel_dirty_page(), but when IO is happening

In theory one could add a more forceful IO cancel mechanism using
special driver callbacks, but I'm not sure it's worth it. Normally the 
hardware should abort on hitting poison (although some might do strange things)
and you'll get some more (recoverable) machine checks.

> > > How does that work for the poisoning case? We substitute a fresh page?
> >
> > It depends on the state of the page. If it was a clean disk mapped
> > page yes (it's just invalidated and can be reloaded). If it's a dirty anon
> > page the process is normally killed first (with advisory mode on) or only
> > killed when it hits the corrupted page. The process can also
> > catch the signal if it choses so. The late killing works with
> > a special entry similar to the migration case, but that results
> > in a special SIGBUS.
> 
> I think a process needs to be killed if any MLOCKed page gets corrupted
> because the OS cannot keep the POSIX guarantees.

That's the default behaviour with vm.memory_failure_early_kill = 1
However the process can catch the signal if it wants.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
  2009-04-07 15:09   ` Andi Kleen
@ 2009-04-07 23:21     ` Minchan Kim
  -1 siblings, 0 replies; 150+ messages in thread
From: Minchan Kim @ 2009-04-07 23:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

Hi, Andi.

On Wed, Apr 8, 2009 at 12:09 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Make sure no poisoned pages are put back into the free page
> lists.  This can happen with some races.
>
> This is allo slow path in the bad page bits path, so another
> check doesn't really matter.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> ---
>  mm/page_alloc.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c  2009-04-07 16:39:26.000000000 +0200
> +++ linux/mm/page_alloc.c       2009-04-07 16:39:39.000000000 +0200
> @@ -228,6 +228,15 @@
>        static unsigned long nr_unshown;
>
>        /*
> +        * Page may have been marked bad before process is freeing it.
> +        * Make sure it is not put back into the free page lists.
> +        */
> +       if (PagePoison(page)) {
> +               /* check more flags here... */

How about adding WARNING with some information(ex, pfn, flags..).


> +               return;
> +       }
> +
> +       /*
>         * Allow a burst of 60 reports, then keep quiet for that minute;
>         * or allow a steady drip of one report per second.
>         */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
@ 2009-04-07 23:21     ` Minchan Kim
  0 siblings, 0 replies; 150+ messages in thread
From: Minchan Kim @ 2009-04-07 23:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

Hi, Andi.

On Wed, Apr 8, 2009 at 12:09 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Make sure no poisoned pages are put back into the free page
> lists.  This can happen with some races.
>
> This is allo slow path in the bad page bits path, so another
> check doesn't really matter.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> ---
>  mm/page_alloc.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c  2009-04-07 16:39:26.000000000 +0200
> +++ linux/mm/page_alloc.c       2009-04-07 16:39:39.000000000 +0200
> @@ -228,6 +228,15 @@
>        static unsigned long nr_unshown;
>
>        /*
> +        * Page may have been marked bad before process is freeing it.
> +        * Make sure it is not put back into the free page lists.
> +        */
> +       if (PagePoison(page)) {
> +               /* check more flags here... */

How about adding WARNING with some information(ex, pfn, flags..).


> +               return;
> +       }
> +
> +       /*
>         * Allow a burst of 60 reports, then keep quiet for that minute;
>         * or allow a steady drip of one report per second.
>         */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-07 15:09   ` Andi Kleen
@ 2009-04-08  0:29     ` Russ Anderson
  -1 siblings, 0 replies; 150+ messages in thread
From: Russ Anderson @ 2009-04-08  0:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86, rja

On Tue, Apr 07, 2009 at 05:09:58PM +0200, Andi Kleen wrote:
> 
> Poisoned pages need special handling in the VM and shouldn't be touched 
> again. This requires a new page flag. Define it here.
> 
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one. I hope.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/page-flags.h |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h	2009-04-07 16:39:27.000000000 +0200
> +++ linux/include/linux/page-flags.h	2009-04-07 16:39:39.000000000 +0200
> @@ -51,6 +51,9 @@
>   * PG_buddy is set to indicate that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
>   *
> + * PG_poison indicates that a page got corrupted in hardware and contains
> + * data with incorrect ECC bits that triggered a machine check. Accessing is
> + * not safe since it may cause another machine check. Don't touch!
>   */
>  
>  /*
> @@ -104,6 +107,9 @@
>  #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
>  	PG_uncached,		/* Page has been mapped as uncached */
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE

Is it necessary to have this under CONFIG_MEMORY_FAILURE?

> +	PG_poison,		/* poisoned page. Don't touch */
> +#endif
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -273,6 +279,14 @@
>  PAGEFLAG_FALSE(Uncached)
>  #endif
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +PAGEFLAG(Poison, poison)
> +#define __PG_POISON (1UL << PG_poison)
> +#else
> +PAGEFLAG_FALSE(Poison)
> +#define __PG_POISON 0
> +#endif
> +
>  static inline int PageUptodate(struct page *page)
>  {
>  	int ret = test_bit(PG_uptodate, &(page)->flags);
> @@ -403,7 +417,7 @@
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 __PG_UNEVICTABLE | __PG_MLOCKED)
> +	 __PG_POISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
>  
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-08  0:29     ` Russ Anderson
  0 siblings, 0 replies; 150+ messages in thread
From: Russ Anderson @ 2009-04-08  0:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86, rja

On Tue, Apr 07, 2009 at 05:09:58PM +0200, Andi Kleen wrote:
> 
> Poisoned pages need special handling in the VM and shouldn't be touched 
> again. This requires a new page flag. Define it here.
> 
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one. I hope.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/page-flags.h |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h	2009-04-07 16:39:27.000000000 +0200
> +++ linux/include/linux/page-flags.h	2009-04-07 16:39:39.000000000 +0200
> @@ -51,6 +51,9 @@
>   * PG_buddy is set to indicate that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
>   *
> + * PG_poison indicates that a page got corrupted in hardware and contains
> + * data with incorrect ECC bits that triggered a machine check. Accessing is
> + * not safe since it may cause another machine check. Don't touch!
>   */
>  
>  /*
> @@ -104,6 +107,9 @@
>  #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
>  	PG_uncached,		/* Page has been mapped as uncached */
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE

Is it necessary to have this under CONFIG_MEMORY_FAILURE?

> +	PG_poison,		/* poisoned page. Don't touch */
> +#endif
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -273,6 +279,14 @@
>  PAGEFLAG_FALSE(Uncached)
>  #endif
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +PAGEFLAG(Poison, poison)
> +#define __PG_POISON (1UL << PG_poison)
> +#else
> +PAGEFLAG_FALSE(Poison)
> +#define __PG_POISON 0
> +#endif
> +
>  static inline int PageUptodate(struct page *page)
>  {
>  	int ret = test_bit(PG_uptodate, &(page)->flags);
> @@ -403,7 +417,7 @@
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 __PG_UNEVICTABLE | __PG_MLOCKED)
> +	 __PG_POISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
>  
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-07 15:09   ` Andi Kleen
@ 2009-04-08  5:14     ` Andrew Morton
  -1 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  5:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:

> Poisoned pages need special handling in the VM and shouldn't be touched 
> again. This requires a new page flag. Define it here.

I wish this patchset didn't change/abuse the well-understood meaning of
the word "poison".

> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one. I hope.

They are?  How did it all get addressed?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-08  5:14     ` Andrew Morton
  0 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  5:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:

> Poisoned pages need special handling in the VM and shouldn't be touched 
> again. This requires a new page flag. Define it here.

I wish this patchset didn't change/abuse the well-understood meaning of
the word "poison".

> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one. I hope.

They are?  How did it all get addressed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-08  5:15   ` Andrew Morton
  -1 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  5:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:

> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.

If the page is clean then we can just toss it and grab a new one from
backing store without killing anyone.

Does the patchset do that?

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-08  5:15   ` Andrew Morton
  0 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  5:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:

> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.

If the page is clean then we can just toss it and grab a new one from
backing store without killing anyone.

Does the patchset do that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-07 15:09 ` Andi Kleen
@ 2009-04-08  5:47   ` Andrew Morton
  -1 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  5:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:

> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.

Seems that this feature is crying out for a testing framework (perhaps
it already has one?).  A simplistic approach would be

	echo some-pfn > /proc/bad-pfn-goes-here

A slightly more sophisticated version might do the deed from within a
timer interrupt, just to get a bit more coverage.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-08  5:47   ` Andrew Morton
  0 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  5:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:

> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.

Seems that this feature is crying out for a testing framework (perhaps
it already has one?).  A simplistic approach would be

	echo some-pfn > /proc/bad-pfn-goes-here

A slightly more sophisticated version might do the deed from within a
timer interrupt, just to get a bit more coverage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-08  5:15   ` Andrew Morton
@ 2009-04-08  6:15     ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:15:42PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
> 
> If the page is clean then we can just toss it and grab a new one from
> backing store without killing anyone.
> 
> Does the patchset do that?

Yes. But it only really works for shared mmap, anonymous and private
tends to be near always dirty.

Also you can disable even the early kill and only request kill
on access.

It also does some other tricks, like for dirty file just trigger
an IO error (although I must admit the dirty handling is rather
tricky and I would appreciate very careful review of that part)s

A few other known recovery tricks are not implemented yet
(like handling free memory[1]), but will be over time.

-Andi

[1] I didn't consider that one high priority since production
systems with long uptime shouldn't have much free memory.

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-08  6:15     ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:15:42PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
> 
> If the page is clean then we can just toss it and grab a new one from
> backing store without killing anyone.
> 
> Does the patchset do that?

Yes. But it only really works for shared mmap, anonymous and private
tends to be near always dirty.

Also you can disable even the early kill and only request kill
on access.

It also does some other tricks, like for dirty file just trigger
an IO error (although I must admit the dirty handling is rather
tricky and I would appreciate very careful review of that part)s

A few other known recovery tricks are not implemented yet
(like handling free memory[1]), but will be over time.

-Andi

[1] I didn't consider that one high priority since production
systems with long uptime shouldn't have much free memory.

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-08  5:47   ` Andrew Morton
@ 2009-04-08  6:21     ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
> 
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?). 

Multiple ones in fact.

One of them is 

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
(test suite covering various cases)

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
(injector using the x86 specific error injection hooks I posted
earlier)

Then i have some tests using the madvise MADV_POISON hook
(which tests the various cases from a process stand points
and recovers). This is still a little hackish, but if there's
interest I can put it out. It has at least one test case
that is known to hang (non linear mappings), still looking
at that.

Long term plan was to put both mce-test above and the
MADV_POISON test into LTP.

And a few random hacks. But coverage is still not 100%

> A simplistic approach would be

Random kill anywhere is hard to test because your system will
die regularly and randomly. mce-test.git does some automated
testing of fatal errors by catching them using kexec, but we haven't
tried that for full recovery.

> 
> 	echo some-pfn > /proc/bad-pfn-goes-here
> 
> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.

mce-test/inject does it from other CPUs with smp_function_call_single,
so it's really relatively random. I've considered to use NMIs too,
but at least the high level recovery code synchronizes first
to work queue context anyways, so it doesn't buy us too much for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-08  6:21     ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
> 
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?). 

Multiple ones in fact.

One of them is 

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
(test suite covering various cases)

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
(injector using the x86 specific error injection hooks I posted
earlier)

Then i have some tests using the madvise MADV_POISON hook
(which tests the various cases from a process stand points
and recovers). This is still a little hackish, but if there's
interest I can put it out. It has at least one test case
that is known to hang (non linear mappings), still looking
at that.

Long term plan was to put both mce-test above and the
MADV_POISON test into LTP.

And a few random hacks. But coverage is still not 100%

> A simplistic approach would be

Random kill anywhere is hard to test because your system will
die regularly and randomly. mce-test.git does some automated
testing of fatal errors by catching them using kexec, but we haven't
tried that for full recovery.

> 
> 	echo some-pfn > /proc/bad-pfn-goes-here
> 
> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.

mce-test/inject does it from other CPUs with smp_function_call_single,
so it's really relatively random. I've considered to use NMIs too,
but at least the high level recovery code synchronizes first
to work queue context anyways, so it doesn't buy us too much for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-08  5:14     ` Andrew Morton
@ 2009-04-08  6:24       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Poisoned pages need special handling in the VM and shouldn't be touched 
> > again. This requires a new page flag. Define it here.
> 
> I wish this patchset didn't change/abuse the well-understood meaning of
> the word "poison".

Sorry, that's the terminology on the hardware side.

If there's much confusion I could rename it HwPoison or somesuch?

> > The page flags wars seem to be over, so it shouldn't be a problem
> > to get a new one. I hope.
> 
> They are?  How did it all get addressed?

Allowing 64bit to use more and using [V]SPARSEMAP to limit flags
use for zones. I think.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-08  6:24       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Poisoned pages need special handling in the VM and shouldn't be touched 
> > again. This requires a new page flag. Define it here.
> 
> I wish this patchset didn't change/abuse the well-understood meaning of
> the word "poison".

Sorry, that's the terminology on the hardware side.

If there's much confusion I could rename it HwPoison or somesuch?

> > The page flags wars seem to be over, so it shouldn't be a problem
> > to get a new one. I hope.
> 
> They are?  How did it all get addressed?

Allowing 64bit to use more and using [V]SPARSEMAP to limit flags
use for zones. I think.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-08  0:29     ` Russ Anderson
@ 2009-04-08  6:26       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:26 UTC (permalink / raw)
  To: Russ Anderson; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

> > @@ -104,6 +107,9 @@
> >  #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
> >  	PG_uncached,		/* Page has been mapped as uncached */
> >  #endif
> > +#ifdef CONFIG_MEMORY_FAILURE
> 
> Is it necessary to have this under CONFIG_MEMORY_FAILURE?

That was mainly so that !MEMORY_FAILURE 32bits NUMA architectures who
might not use sparsemap/vsparsemap get a few more zone bits in page flags
to play with. Not sure those really exist, so it might be indeed
redundant, but it seemed safer.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-08  6:26       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:26 UTC (permalink / raw)
  To: Russ Anderson; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

> > @@ -104,6 +107,9 @@
> >  #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
> >  	PG_uncached,		/* Page has been mapped as uncached */
> >  #endif
> > +#ifdef CONFIG_MEMORY_FAILURE
> 
> Is it necessary to have this under CONFIG_MEMORY_FAILURE?

That was mainly so that !MEMORY_FAILURE 32bits NUMA architectures who
might not use sparsemap/vsparsemap get a few more zone bits in page flags
to play with. Not sure those really exist, so it might be indeed
redundant, but it seemed safer.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
  2009-04-07 23:21     ` Minchan Kim
@ 2009-04-08  6:51       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:51 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

> >
> >        /*
> > +        * Page may have been marked bad before process is freeing it.
> > +        * Make sure it is not put back into the free page lists.
> > +        */
> > +       if (PagePoison(page)) {
> > +               /* check more flags here... */
> 
> How about adding WARNING with some information(ex, pfn, flags..).

The memory_failure() code is already quite chatty. Don't think more
noise is needed currently.

Or are you worrying about the case where a page gets corrupted
by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
That would deserve a printk, but I'm not sure how to reliably test for
that. After all a lot of flag combinations are valid.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
@ 2009-04-08  6:51       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  6:51 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

> >
> >        /*
> > +        * Page may have been marked bad before process is freeing it.
> > +        * Make sure it is not put back into the free page lists.
> > +        */
> > +       if (PagePoison(page)) {
> > +               /* check more flags here... */
> 
> How about adding WARNING with some information(ex, pfn, flags..).

The memory_failure() code is already quite chatty. Don't think more
noise is needed currently.

Or are you worrying about the case where a page gets corrupted
by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
That would deserve a printk, but I'm not sure how to reliably test for
that. After all a lot of flag combinations are valid.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-08  6:24       ` Andi Kleen
@ 2009-04-08  7:00         ` Andrew Morton
  -1 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  7:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Wed, 8 Apr 2009 08:24:41 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> > On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > Poisoned pages need special handling in the VM and shouldn't be touched 
> > > again. This requires a new page flag. Define it here.
> > 
> > I wish this patchset didn't change/abuse the well-understood meaning of
> > the word "poison".
> 
> Sorry, that's the terminology on the hardware side.
> 
> If there's much confusion I could rename it HwPoison or somesuch?

I understand that'd be a PITA but I suspect it would be best,
long-term.  Having this conflict in core MM is really pretty bad.

> > > The page flags wars seem to be over, so it shouldn't be a problem
> > > to get a new one. I hope.
> > 
> > They are?  How did it all get addressed?
> 
> Allowing 64bit to use more and using [V]SPARSEMAP to limit flags
> use for zones. I think.

Nobody ever seems to be able to work out how many we actually have
left.


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-08  7:00         ` Andrew Morton
  0 siblings, 0 replies; 150+ messages in thread
From: Andrew Morton @ 2009-04-08  7:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Wed, 8 Apr 2009 08:24:41 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> > On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > Poisoned pages need special handling in the VM and shouldn't be touched 
> > > again. This requires a new page flag. Define it here.
> > 
> > I wish this patchset didn't change/abuse the well-understood meaning of
> > the word "poison".
> 
> Sorry, that's the terminology on the hardware side.
> 
> If there's much confusion I could rename it HwPoison or somesuch?

I understand that'd be a PITA but I suspect it would be best,
long-term.  Having this conflict in core MM is really pretty bad.

> > > The page flags wars seem to be over, so it shouldn't be a problem
> > > to get a new one. I hope.
> > 
> > They are?  How did it all get addressed?
> 
> Allowing 64bit to use more and using [V]SPARSEMAP to limit flags
> use for zones. I think.

Nobody ever seems to be able to work out how many we actually have
left.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
  2009-04-08  6:51       ` Andi Kleen
@ 2009-04-08  7:39         ` Minchan Kim
  -1 siblings, 0 replies; 150+ messages in thread
From: Minchan Kim @ 2009-04-08  7:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >
>> >        /*
>> > +        * Page may have been marked bad before process is freeing it.
>> > +        * Make sure it is not put back into the free page lists.
>> > +        */
>> > +       if (PagePoison(page)) {
>> > +               /* check more flags here... */
>>
>> How about adding WARNING with some information(ex, pfn, flags..).
>
> The memory_failure() code is already quite chatty. Don't think more
> noise is needed currently.

Sure.

> Or are you worrying about the case where a page gets corrupted
> by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
> That would deserve a printk, but I'm not sure how to reliably test for
> that. After all a lot of flag combinations are valid.

I misunderstood your code.
That's because you add the code in bad_page.

As you commented, your intention was to prevent bad page from returning buddy.
Is right ?
If it is right, how about adding prevention code to free_pages_check ?
Now, bad_page is for showing the information that why it is bad page
I don't like emergency exit in bad_page.

> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only.
>



-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
@ 2009-04-08  7:39         ` Minchan Kim
  0 siblings, 0 replies; 150+ messages in thread
From: Minchan Kim @ 2009-04-08  7:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >
>> >        /*
>> > +        * Page may have been marked bad before process is freeing it.
>> > +        * Make sure it is not put back into the free page lists.
>> > +        */
>> > +       if (PagePoison(page)) {
>> > +               /* check more flags here... */
>>
>> How about adding WARNING with some information(ex, pfn, flags..).
>
> The memory_failure() code is already quite chatty. Don't think more
> noise is needed currently.

Sure.

> Or are you worrying about the case where a page gets corrupted
> by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
> That would deserve a printk, but I'm not sure how to reliably test for
> that. After all a lot of flag combinations are valid.

I misunderstood your code.
That's because you add the code in bad_page.

As you commented, your intention was to prevent bad page from returning buddy.
Is right ?
If it is right, how about adding prevention code to free_pages_check ?
Now, bad_page is for showing the information that why it is bad page
I don't like emergency exit in bad_page.

> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only.
>



-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
  2009-04-08  7:00         ` Andrew Morton
@ 2009-04-08  9:38           ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  9:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 12:00:18AM -0700, Andrew Morton wrote:
> On Wed, 8 Apr 2009 08:24:41 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> > > On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> > > 
> > > > Poisoned pages need special handling in the VM and shouldn't be touched 
> > > > again. This requires a new page flag. Define it here.
> > > 
> > > I wish this patchset didn't change/abuse the well-understood meaning of
> > > the word "poison".
> > 
> > Sorry, that's the terminology on the hardware side.
> > 
> > If there's much confusion I could rename it HwPoison or somesuch?
> 
> I understand that'd be a PITA but I suspect it would be best,
> long-term.  Having this conflict in core MM is really pretty bad.

Ok. I'll rename it to HWPoison().

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [2/16] POISON: Add page flag for poisoned pages
@ 2009-04-08  9:38           ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  9:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 12:00:18AM -0700, Andrew Morton wrote:
> On Wed, 8 Apr 2009 08:24:41 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> > > On Tue,  7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> > > 
> > > > Poisoned pages need special handling in the VM and shouldn't be touched 
> > > > again. This requires a new page flag. Define it here.
> > > 
> > > I wish this patchset didn't change/abuse the well-understood meaning of
> > > the word "poison".
> > 
> > Sorry, that's the terminology on the hardware side.
> > 
> > If there's much confusion I could rename it HwPoison or somesuch?
> 
> I understand that'd be a PITA but I suspect it would be best,
> long-term.  Having this conflict in core MM is really pretty bad.

Ok. I'll rename it to HWPoison().

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
  2009-04-08  7:39         ` Minchan Kim
@ 2009-04-08  9:41           ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  9:41 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 04:39:17PM +0900, Minchan Kim wrote:
> On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> >
> >> >        /*
> >> > +        * Page may have been marked bad before process is freeing it.
> >> > +        * Make sure it is not put back into the free page lists.
> >> > +        */
> >> > +       if (PagePoison(page)) {
> >> > +               /* check more flags here... */
> >>
> >> How about adding WARNING with some information(ex, pfn, flags..).
> >
> > The memory_failure() code is already quite chatty. Don't think more
> > noise is needed currently.
> 
> Sure.
> 
> > Or are you worrying about the case where a page gets corrupted
> > by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
> > That would deserve a printk, but I'm not sure how to reliably test for
> > that. After all a lot of flag combinations are valid.
> 
> I misunderstood your code.
> That's because you add the code in bad_page.
> 
> As you commented, your intention was to prevent bad page from returning buddy.
> Is right ?

Yes. Well actually it should not happen anymore. Perhaps I should
make it a BUG()

> If it is right, how about adding prevention code to free_pages_check ?
> Now, bad_page is for showing the information that why it is bad page
> I don't like emergency exit in bad_page.

There's already one in there, so i just reused that one. It was a convenient
way to keep things out of the fast path

-Andi

ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
@ 2009-04-08  9:41           ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-08  9:41 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 04:39:17PM +0900, Minchan Kim wrote:
> On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> >
> >> >        /*
> >> > +        * Page may have been marked bad before process is freeing it.
> >> > +        * Make sure it is not put back into the free page lists.
> >> > +        */
> >> > +       if (PagePoison(page)) {
> >> > +               /* check more flags here... */
> >>
> >> How about adding WARNING with some information(ex, pfn, flags..).
> >
> > The memory_failure() code is already quite chatty. Don't think more
> > noise is needed currently.
> 
> Sure.
> 
> > Or are you worrying about the case where a page gets corrupted
> > by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
> > That would deserve a printk, but I'm not sure how to reliably test for
> > that. After all a lot of flag combinations are valid.
> 
> I misunderstood your code.
> That's because you add the code in bad_page.
> 
> As you commented, your intention was to prevent bad page from returning buddy.
> Is right ?

Yes. Well actually it should not happen anymore. Perhaps I should
make it a BUG()

> If it is right, how about adding prevention code to free_pages_check ?
> Now, bad_page is for showing the information that why it is bad page
> I don't like emergency exit in bad_page.

There's already one in there, so i just reused that one. It was a convenient
way to keep things out of the fast path

-Andi

ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
  2009-04-08  9:41           ` Andi Kleen
@ 2009-04-08 10:05             ` Minchan Kim
  -1 siblings, 0 replies; 150+ messages in thread
From: Minchan Kim @ 2009-04-08 10:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Wed, Apr 8, 2009 at 6:41 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Apr 08, 2009 at 04:39:17PM +0900, Minchan Kim wrote:
>> On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >> >
>> >> >        /*
>> >> > +        * Page may have been marked bad before process is freeing it.
>> >> > +        * Make sure it is not put back into the free page lists.
>> >> > +        */
>> >> > +       if (PagePoison(page)) {
>> >> > +               /* check more flags here... */
>> >>
>> >> How about adding WARNING with some information(ex, pfn, flags..).
>> >
>> > The memory_failure() code is already quite chatty. Don't think more
>> > noise is needed currently.
>>
>> Sure.
>>
>> > Or are you worrying about the case where a page gets corrupted
>> > by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
>> > That would deserve a printk, but I'm not sure how to reliably test for
>> > that. After all a lot of flag combinations are valid.
>>
>> I misunderstood your code.
>> That's because you add the code in bad_page.
>>
>> As you commented, your intention was to prevent bad page from returning buddy.
>> Is right ?
>
> Yes. Well actually it should not happen anymore. Perhaps I should
> make it a BUG()
>
>> If it is right, how about adding prevention code to free_pages_check ?
>> Now, bad_page is for showing the information that why it is bad page
>> I don't like emergency exit in bad_page.
>
> There's already one in there, so i just reused that one. It was a convenient
> way to keep things out of the fast path


Sorry for my vague previous comment.
I mean bad_page function's role is just to print why it is bad now.
Whoever can use bad_page to show information.
If someone begin to add side branch in bad_page, anonther people might
add his exception case in one.

So, I think it would be better to check PagePoison in free_pages_check
not bad_page. :)

> -Andi
>
> ak@linux.intel.com -- Speaking for myself only.
>



-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [3/16] POISON: Handle poisoned pages in page free
@ 2009-04-08 10:05             ` Minchan Kim
  0 siblings, 0 replies; 150+ messages in thread
From: Minchan Kim @ 2009-04-08 10:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, x86

On Wed, Apr 8, 2009 at 6:41 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Apr 08, 2009 at 04:39:17PM +0900, Minchan Kim wrote:
>> On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >> >
>> >> >        /*
>> >> > +        * Page may have been marked bad before process is freeing it.
>> >> > +        * Make sure it is not put back into the free page lists.
>> >> > +        */
>> >> > +       if (PagePoison(page)) {
>> >> > +               /* check more flags here... */
>> >>
>> >> How about adding WARNING with some information(ex, pfn, flags..).
>> >
>> > The memory_failure() code is already quite chatty. Don't think more
>> > noise is needed currently.
>>
>> Sure.
>>
>> > Or are you worrying about the case where a page gets corrupted
>> > by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
>> > That would deserve a printk, but I'm not sure how to reliably test for
>> > that. After all a lot of flag combinations are valid.
>>
>> I misunderstood your code.
>> That's because you add the code in bad_page.
>>
>> As you commented, your intention was to prevent bad page from returning buddy.
>> Is right ?
>
> Yes. Well actually it should not happen anymore. Perhaps I should
> make it a BUG()
>
>> If it is right, how about adding prevention code to free_pages_check ?
>> Now, bad_page is for showing the information that why it is bad page
>> I don't like emergency exit in bad_page.
>
> There's already one in there, so i just reused that one. It was a convenient
> way to keep things out of the fast path


Sorry for my vague previous comment.
I mean bad_page function's role is just to print why it is bad now.
Whoever can use bad_page to show information.
If someone begin to add side branch in bad_page, anonther people might
add his exception case in one.

So, I think it would be better to check PagePoison in free_pages_check
not bad_page. :)

> -Andi
>
> ak@linux.intel.com -- Speaking for myself only.
>



-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-07 15:10   ` Andi Kleen
@ 2009-04-08 17:03     ` Chris Mason
  -1 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-08 17:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

On Tue, 2009-04-07 at 17:10 +0200, Andi Kleen wrote:
> This patch adds the high level memory handler that poisons pages. 
> It is portable code and lives in mm/memory-failure.c

I think this is an important feature, thanks for doing all this work
Andi.

> Index: linux/mm/memory-failure.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/mm/memory-failure.c	2009-04-07 16:39:39.000000000 +0200
> +
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p)
> +{
> +	struct address_space *mapping;
> +
> +	if (PagePrivate(p))
> +		do_invalidatepage(p, 0);
> +	mapping = page_mapping(p);
> +	if (mapping) {
> +		if (!remove_mapping(mapping, p))
> +			return FAILED;
> +	}
> +	return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p)
> +{
> +	struct address_space *mapping = page_mapping(p);
> +
> +	SetPageError(p);
> +	/* TBD: print more information about the file. */
> +	printk(KERN_ERR "MCE: Hardware memory corruption on dirty file page: write error\n");
> +	if (mapping) {
> +		/* CHECKME: does that report the error in all cases? */
> +		mapping_set_error(mapping, EIO);
> +	}
> +	if (PagePrivate(p)) {
> +		if (try_to_release_page(p, GFP_KERNEL)) {

So, try_to_release_page returns 1 when it works.  I know this only
because I have to read it every time to remember ;)

try_to_release_page is also very likely to fail if the page is dirty or
under writeback.  At the end of the day, we'll probably need a call into
the FS to tell it a given page isn't coming back, and to clean it at all
cost.

invalidatepage is close, but ext3/reiserfs will keep the buffer heads
and let the page->mapping go to null in an ugly data=ordered corner
case.  The buffer heads pin the page and it won't be freed until the IO
is done.

-chris



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-08 17:03     ` Chris Mason
  0 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-08 17:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

On Tue, 2009-04-07 at 17:10 +0200, Andi Kleen wrote:
> This patch adds the high level memory handler that poisons pages. 
> It is portable code and lives in mm/memory-failure.c

I think this is an important feature, thanks for doing all this work
Andi.

> Index: linux/mm/memory-failure.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/mm/memory-failure.c	2009-04-07 16:39:39.000000000 +0200
> +
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p)
> +{
> +	struct address_space *mapping;
> +
> +	if (PagePrivate(p))
> +		do_invalidatepage(p, 0);
> +	mapping = page_mapping(p);
> +	if (mapping) {
> +		if (!remove_mapping(mapping, p))
> +			return FAILED;
> +	}
> +	return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p)
> +{
> +	struct address_space *mapping = page_mapping(p);
> +
> +	SetPageError(p);
> +	/* TBD: print more information about the file. */
> +	printk(KERN_ERR "MCE: Hardware memory corruption on dirty file page: write error\n");
> +	if (mapping) {
> +		/* CHECKME: does that report the error in all cases? */
> +		mapping_set_error(mapping, EIO);
> +	}
> +	if (PagePrivate(p)) {
> +		if (try_to_release_page(p, GFP_KERNEL)) {

So, try_to_release_page returns 1 when it works.  I know this only
because I have to read it every time to remember ;)

try_to_release_page is also very likely to fail if the page is dirty or
under writeback.  At the end of the day, we'll probably need a call into
the FS to tell it a given page isn't coming back, and to clean it at all
cost.

invalidatepage is close, but ext3/reiserfs will keep the buffer heads
and let the page->mapping go to null in an ugly data=ordered corner
case.  The buffer heads pin the page and it won't be freed until the IO
is done.

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-08  6:15     ` Andi Kleen
@ 2009-04-08 17:29       ` Roland Dreier
  -1 siblings, 0 replies; 150+ messages in thread
From: Roland Dreier @ 2009-04-08 17:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, linux-mm, x86

 > [1] I didn't consider that one high priority since production
 > systems with long uptime shouldn't have much free memory.

Surely there are windows after a big job exits where lots of memory
might be free.  Not sure how big those windows are in practice but it
does seem if a process using 128GB exits then it might take a while
before that memory all gets used again.

 - R.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-08 17:29       ` Roland Dreier
  0 siblings, 0 replies; 150+ messages in thread
From: Roland Dreier @ 2009-04-08 17:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, linux-mm, x86

 > [1] I didn't consider that one high priority since production
 > systems with long uptime shouldn't have much free memory.

Surely there are windows after a big job exits where lots of memory
might be free.  Not sure how big those windows are in practice but it
does seem if a process using 128GB exits then it might take a while
before that memory all gets used again.

 - R.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-08 17:29       ` Roland Dreier
@ 2009-04-09  7:22         ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09  7:22 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Andi Kleen, Andrew Morton, linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 10:29:34AM -0700, Roland Dreier wrote:
>  > [1] I didn't consider that one high priority since production
>  > systems with long uptime shouldn't have much free memory.
> 
> Surely there are windows after a big job exits where lots of memory
> might be free.  Not sure how big those windows are in practice but it
> does seem if a process using 128GB exits then it might take a while
> before that memory all gets used again.

Yes, it's definitely something to be fixed at some point.
Basically just needs a new entry point into the page_alloc
buddy allocator to unfree a page. The more tricky part
is actually finding a good injector design for testing for it,
there's no natural race free way to get a free page.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-09  7:22         ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09  7:22 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Andi Kleen, Andrew Morton, linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 10:29:34AM -0700, Roland Dreier wrote:
>  > [1] I didn't consider that one high priority since production
>  > systems with long uptime shouldn't have much free memory.
> 
> Surely there are windows after a big job exits where lots of memory
> might be free.  Not sure how big those windows are in practice but it
> does seem if a process using 128GB exits then it might take a while
> before that memory all gets used again.

Yes, it's definitely something to be fixed at some point.
Basically just needs a new entry point into the page_alloc
buddy allocator to unfree a page. The more tricky part
is actually finding a good injector design for testing for it,
there's no natural race free way to get a free page.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
  2009-04-08 17:03     ` Chris Mason
@ 2009-04-09  7:29       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09  7:29 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 01:03:59PM -0400, Chris Mason wrote:

Hi Chris,

Thanks for the review.

> So, try_to_release_page returns 1 when it works.  I know this only
> because I have to read it every time to remember ;)

Argh. I think I read that, but then somehow the code still came out
wrong and the tester didn't catch the failure.

> 
> try_to_release_page is also very likely to fail if the page is dirty or
> under writeback.  At the end of the day, we'll probably need a call into

Would you recommend a retry step?  If it fails cancel_dirty_page() and then
retry?

Ideally I would like to stop the write back before it starts (it will
result in a hardware bus abort or even a machine check if the CPU
touches the data), but I realize it's difficult for anything with
private page state. I just cancel dirty for !Private at least.

> the FS to tell it a given page isn't coming back, and to clean it at all
> cost.
> 
> invalidatepage is close, but ext3/reiserfs will keep the buffer heads
> and let the page->mapping go to null in an ugly data=ordered corner
> case.  The buffer heads pin the page and it won't be freed until the IO
> is done.

invalidate_mapping_pages() ? 

I had this in an earlier version, but took it out because it seemed
problematic to rely on a specific inode. Should i reconsider it?


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM
@ 2009-04-09  7:29       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09  7:29 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Wed, Apr 08, 2009 at 01:03:59PM -0400, Chris Mason wrote:

Hi Chris,

Thanks for the review.

> So, try_to_release_page returns 1 when it works.  I know this only
> because I have to read it every time to remember ;)

Argh. I think I read that, but then somehow the code still came out
wrong and the tester didn't catch the failure.

> 
> try_to_release_page is also very likely to fail if the page is dirty or
> under writeback.  At the end of the day, we'll probably need a call into

Would you recommend a retry step?  If it fails cancel_dirty_page() and then
retry?

Ideally I would like to stop the write back before it starts (it will
result in a hardware bus abort or even a machine check if the CPU
touches the data), but I realize it's difficult for anything with
private page state. I just cancel dirty for !Private at least.

> the FS to tell it a given page isn't coming back, and to clean it at all
> cost.
> 
> invalidatepage is close, but ext3/reiserfs will keep the buffer heads
> and let the page->mapping go to null in an ugly data=ordered corner
> case.  The buffer heads pin the page and it won't be freed until the IO
> is done.

invalidate_mapping_pages() ? 

I had this in an earlier version, but took it out because it seemed
problematic to rely on a specific inode. Should i reconsider it?


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-09  7:29       ` Andi Kleen
@ 2009-04-09  7:58         ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09  7:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Mason, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86


Double checked the try_to_release_page logic. My assumption was that the 
writeback case could never trigger, because during write back the page
should be locked and so it's excluded with the earlier lock_page_nosync().

Is that a correct assumption?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-09  7:58         ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09  7:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Mason, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86


Double checked the try_to_release_page logic. My assumption was that the 
writeback case could never trigger, because during write back the page
should be locked and so it's excluded with the earlier lock_page_nosync().

Is that a correct assumption?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-09  7:58         ` Andi Kleen
@ 2009-04-09 13:30           ` Chris Mason
  -1 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-09 13:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

On Thu, 2009-04-09 at 09:58 +0200, Andi Kleen wrote:
> Double checked the try_to_release_page logic. My assumption was that the 
> writeback case could never trigger, because during write back the page
> should be locked and so it's excluded with the earlier lock_page_nosync().
> 
> Is that a correct assumption?

Yes, the page won't become writeback when you're holding the page lock.
But, the FS usually thinks of try_to_releasepage as a polite request.
It might fail internally for a bunch of reasons.

To make things even more fun, the page won't become writeback magically,
but ext3 and reiser maintain lists of buffer heads for data=ordered, and
they do the data=ordered IO on the buffer heads directly.  writepage is
never called and the page lock is never taken, but the buffer heads go
to disk.  I don't think any of the other filesystems do it this way.

At least for Ext3 (and reiser3), try_to_releasepage is required to fail
for some data=ordered corner cases, and the only way it'll end up
passing is if you commit the transaction (which writes the buffer_head)
and try again.  Even invalidatepage will just end up setting
page->mapping to null but leaving the page around for ext3 to finish
processing.

If we really want the page gone, we'll have to tell the FS
drop-this-or-else....sorry, its some ugly stuff.

The good news is, it is pretty rare.  I wouldn't hold up the whole patch
set just for this problem.  We could document the future fun required
and fix the return value check and concentrate on something other than
this ugly corner ;)

-chris



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-09 13:30           ` Chris Mason
  0 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-09 13:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

On Thu, 2009-04-09 at 09:58 +0200, Andi Kleen wrote:
> Double checked the try_to_release_page logic. My assumption was that the 
> writeback case could never trigger, because during write back the page
> should be locked and so it's excluded with the earlier lock_page_nosync().
> 
> Is that a correct assumption?

Yes, the page won't become writeback when you're holding the page lock.
But, the FS usually thinks of try_to_releasepage as a polite request.
It might fail internally for a bunch of reasons.

To make things even more fun, the page won't become writeback magically,
but ext3 and reiser maintain lists of buffer heads for data=ordered, and
they do the data=ordered IO on the buffer heads directly.  writepage is
never called and the page lock is never taken, but the buffer heads go
to disk.  I don't think any of the other filesystems do it this way.

At least for Ext3 (and reiser3), try_to_releasepage is required to fail
for some data=ordered corner cases, and the only way it'll end up
passing is if you commit the transaction (which writes the buffer_head)
and try again.  Even invalidatepage will just end up setting
page->mapping to null but leaving the page around for ext3 to finish
processing.

If we really want the page gone, we'll have to tell the FS
drop-this-or-else....sorry, its some ugly stuff.

The good news is, it is pretty rare.  I wouldn't hold up the whole patch
set just for this problem.  We could document the future fun required
and fix the return value check and concentrate on something other than
this ugly corner ;)

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-09 13:30           ` Chris Mason
@ 2009-04-09 14:02             ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09 14:02 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > Is that a correct assumption?
> 
> Yes, the page won't become writeback when you're holding the page lock.
> But, the FS usually thinks of try_to_releasepage as a polite request.
> It might fail internally for a bunch of reasons.
> 
> To make things even more fun, the page won't become writeback magically,
> but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> they do the data=ordered IO on the buffer heads directly.  writepage is
> never called and the page lock is never taken, but the buffer heads go
> to disk.  I don't think any of the other filesystems do it this way.

Ok, so do you think my code handles this correctly?

> If we really want the page gone, we'll have to tell the FS
> drop-this-or-else....sorry, its some ugly stuff.

I would like to give a very strong hint at least. If it fails
we can still ignore it, but it will likely have negative consequences later.

> 
> The good news is, it is pretty rare.  I wouldn't hold up the whole patch

You mean pages with Private bit are rare? Are you suggesting to just
ignore those? How common is it to have Private pages which are not
locked by someone else?

I keep thinking about doing some instrumentation and figure out
how common the various page types are under different loads, but haven't written 
that bit so far.

> set just for this problem.  We could document the future fun required
> and fix the return value check 

I fixed the return value check. Thanks.

> and concentrate on something other than
> this ugly corner ;)

Any suggestions welcome.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-09 14:02             ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09 14:02 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > Is that a correct assumption?
> 
> Yes, the page won't become writeback when you're holding the page lock.
> But, the FS usually thinks of try_to_releasepage as a polite request.
> It might fail internally for a bunch of reasons.
> 
> To make things even more fun, the page won't become writeback magically,
> but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> they do the data=ordered IO on the buffer heads directly.  writepage is
> never called and the page lock is never taken, but the buffer heads go
> to disk.  I don't think any of the other filesystems do it this way.

Ok, so do you think my code handles this correctly?

> If we really want the page gone, we'll have to tell the FS
> drop-this-or-else....sorry, its some ugly stuff.

I would like to give a very strong hint at least. If it fails
we can still ignore it, but it will likely have negative consequences later.

> 
> The good news is, it is pretty rare.  I wouldn't hold up the whole patch

You mean pages with Private bit are rare? Are you suggesting to just
ignore those? How common is it to have Private pages which are not
locked by someone else?

I keep thinking about doing some instrumentation and figure out
how common the various page types are under different loads, but haven't written 
that bit so far.

> set just for this problem.  We could document the future fun required
> and fix the return value check 

I fixed the return value check. Thanks.

> and concentrate on something other than
> this ugly corner ;)

Any suggestions welcome.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-09 14:02             ` Andi Kleen
@ 2009-04-09 14:37               ` Chris Mason
  -1 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-09 14:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

On Thu, 2009-04-09 at 16:02 +0200, Andi Kleen wrote:
> On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > > Is that a correct assumption?
> > 
> > Yes, the page won't become writeback when you're holding the page lock.
> > But, the FS usually thinks of try_to_releasepage as a polite request.
> > It might fail internally for a bunch of reasons.
> > 
> > To make things even more fun, the page won't become writeback magically,
> > but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> > they do the data=ordered IO on the buffer heads directly.  writepage is
> > never called and the page lock is never taken, but the buffer heads go
> > to disk.  I don't think any of the other filesystems do it this way.
> 
> Ok, so do you think my code handles this correctly?

Even though try_to_releasepage only checks page_writeback() the lower
filesystems all bail on dirty pages or dirty buffers (see the checks
done by try_to_free_buffers).

It looks like the only way we have to clean a page and all the buffers
in it is the invalidatepage call.  But that doesn't return success or
failure, so maybe invalidatepage followed by releasepage?

I'll have to read harder next week, the FS invalidatepage may expect
truncate to be the only caller.

> 
> > If we really want the page gone, we'll have to tell the FS
> > drop-this-or-else....sorry, its some ugly stuff.
> 
> I would like to give a very strong hint at least. If it fails
> we can still ignore it, but it will likely have negative consequences later.
> 

Nod.

> > 
> > The good news is, it is pretty rare.  I wouldn't hold up the whole patch
> 
> You mean pages with Private bit are rare? Are you suggesting to just
> ignore those? How common is it to have Private pages which are not
> locked by someone else?
> 

PagePrivate is very common.  try_to_releasepage failing on a clean page
without the writeback bit set and without dirty/locked buffers will be
pretty rare.

-chris



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-09 14:37               ` Chris Mason
  0 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-09 14:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, npiggin, riel, lee.schermerhorn, akpm, linux-kernel, linux-mm, x86

On Thu, 2009-04-09 at 16:02 +0200, Andi Kleen wrote:
> On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > > Is that a correct assumption?
> > 
> > Yes, the page won't become writeback when you're holding the page lock.
> > But, the FS usually thinks of try_to_releasepage as a polite request.
> > It might fail internally for a bunch of reasons.
> > 
> > To make things even more fun, the page won't become writeback magically,
> > but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> > they do the data=ordered IO on the buffer heads directly.  writepage is
> > never called and the page lock is never taken, but the buffer heads go
> > to disk.  I don't think any of the other filesystems do it this way.
> 
> Ok, so do you think my code handles this correctly?

Even though try_to_releasepage only checks page_writeback() the lower
filesystems all bail on dirty pages or dirty buffers (see the checks
done by try_to_free_buffers).

It looks like the only way we have to clean a page and all the buffers
in it is the invalidatepage call.  But that doesn't return success or
failure, so maybe invalidatepage followed by releasepage?

I'll have to read harder next week, the FS invalidatepage may expect
truncate to be the only caller.

> 
> > If we really want the page gone, we'll have to tell the FS
> > drop-this-or-else....sorry, its some ugly stuff.
> 
> I would like to give a very strong hint at least. If it fails
> we can still ignore it, but it will likely have negative consequences later.
> 

Nod.

> > 
> > The good news is, it is pretty rare.  I wouldn't hold up the whole patch
> 
> You mean pages with Private bit are rare? Are you suggesting to just
> ignore those? How common is it to have Private pages which are not
> locked by someone else?
> 

PagePrivate is very common.  try_to_releasepage failing on a clean page
without the writeback bit set and without dirty/locked buffers will be
pretty rare.

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-09 14:37               ` Chris Mason
@ 2009-04-09 14:57                 ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09 14:57 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

> Even though try_to_releasepage only checks page_writeback() the lower
> filesystems all bail on dirty pages or dirty buffers (see the checks
> done by try_to_free_buffers).
> 
> It looks like the only way we have to clean a page and all the buffers
> in it is the invalidatepage call.  But that doesn't return success or
> failure, so maybe invalidatepage followed by releasepage?

Ok. I'll poke at it more.

> 
> I'll have to read harder next week, the FS invalidatepage may expect
> truncate to be the only caller.

I have to be careful with locks; another lock would deadlock. Ok
I could drop the page lock temporarily, but that would be somewhat
risky of someone else coming in unexpectedly.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-09 14:57                 ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-09 14:57 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

> Even though try_to_releasepage only checks page_writeback() the lower
> filesystems all bail on dirty pages or dirty buffers (see the checks
> done by try_to_free_buffers).
> 
> It looks like the only way we have to clean a page and all the buffers
> in it is the invalidatepage call.  But that doesn't return success or
> failure, so maybe invalidatepage followed by releasepage?

Ok. I'll poke at it more.

> 
> I'll have to read harder next week, the FS invalidatepage may expect
> truncate to be the only caller.

I have to be careful with locks; another lock would deadlock. Ok
I could drop the page lock temporarily, but that would be somewhat
risky of someone else coming in unexpectedly.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-08  5:47   ` Andrew Morton
@ 2009-04-13 13:18     ` Wu Fengguang
  -1 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-13 13:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
> 
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?).  A simplistic approach would be
> 
> 	echo some-pfn > /proc/bad-pfn-goes-here

How about reusing the /proc/kpageflags interface? i.e. make it writable.

It may sound crazy and way too _hacky_, but it is possible to
attach actions to the state transition of some page flags ;)

PG_poison      0 => 1: call memory_failure()

PG_active      1 => 0: move page into inactive lru
PG_unevictable 1 => 0: move page out of unevictable lru
PG_swapcache   1 => 0: remove page from swap cache
PG_lru         1 => 0: reclaim page

Thanks,
Fengguang

> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-04-13 13:18     ` Wu Fengguang
  0 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-13 13:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel, linux-mm, x86

On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
> 
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?).  A simplistic approach would be
> 
> 	echo some-pfn > /proc/bad-pfn-goes-here

How about reusing the /proc/kpageflags interface? i.e. make it writable.

It may sound crazy and way too _hacky_, but it is possible to
attach actions to the state transition of some page flags ;)

PG_poison      0 => 1: call memory_failure()

PG_active      1 => 0: move page into inactive lru
PG_unevictable 1 => 0: move page out of unevictable lru
PG_swapcache   1 => 0: remove page from swap cache
PG_lru         1 => 0: reclaim page

Thanks,
Fengguang

> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-09 14:37               ` Chris Mason
@ 2009-04-29  8:16                 ` Wu Fengguang
  -1 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29  8:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> On Thu, 2009-04-09 at 16:02 +0200, Andi Kleen wrote:
> > On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > > > Is that a correct assumption?
> > > 
> > > Yes, the page won't become writeback when you're holding the page lock.
> > > But, the FS usually thinks of try_to_releasepage as a polite request.
> > > It might fail internally for a bunch of reasons.
> > > 
> > > To make things even more fun, the page won't become writeback magically,
> > > but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> > > they do the data=ordered IO on the buffer heads directly.  writepage is
> > > never called and the page lock is never taken, but the buffer heads go
> > > to disk.  I don't think any of the other filesystems do it this way.
> > 
> > Ok, so do you think my code handles this correctly?
> 
> Even though try_to_releasepage only checks page_writeback() the lower
> filesystems all bail on dirty pages or dirty buffers (see the checks
> done by try_to_free_buffers).
> 
> It looks like the only way we have to clean a page and all the buffers
> in it is the invalidatepage call.  But that doesn't return success or
> failure, so maybe invalidatepage followed by releasepage?
> 
> I'll have to read harder next week, the FS invalidatepage may expect
> truncate to be the only caller.

If direct de-dirty is hard for some pages, how about just ignore them?
There are the PG_writeback pages anyway. We can inject code to
intercept them at the last stage of IO request dispatching.

Some perceivable problems and solutions are
1) the intercepting overheads could be costly => inject code at runtime.
2) there are cases that the dirty page could be copied for IO:
   2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
     2.1.1) do_get_write_access(): buffer sits in two active commits
     2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
            with JBD2_MAGIC_NUMBER
   2.2) btrfs have to read page for compress/encryption
     Chris: is btrfs_zlib_compress_pages() a good place for detecting
     poison pages? Or is it necessary at all for btrfs?(ie. it's
     already relatively easy to de-dirty btrfs pages.)
   2.3) maybe more cases...

> > 
> > > If we really want the page gone, we'll have to tell the FS
> > > drop-this-or-else....sorry, its some ugly stuff.
> > 
> > I would like to give a very strong hint at least. If it fails
> > we can still ignore it, but it will likely have negative consequences later.
> > 
> 
> Nod.
> 
> > > 
> > > The good news is, it is pretty rare.  I wouldn't hold up the whole patch
> > 
> > You mean pages with Private bit are rare? Are you suggesting to just
> > ignore those? How common is it to have Private pages which are not
> > locked by someone else?
> > 
> 
> PagePrivate is very common.  try_to_releasepage failing on a clean page
> without the writeback bit set and without dirty/locked buffers will be
> pretty rare.

Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
While ext4 won't.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-29  8:16                 ` Wu Fengguang
  0 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29  8:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> On Thu, 2009-04-09 at 16:02 +0200, Andi Kleen wrote:
> > On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > > > Is that a correct assumption?
> > > 
> > > Yes, the page won't become writeback when you're holding the page lock.
> > > But, the FS usually thinks of try_to_releasepage as a polite request.
> > > It might fail internally for a bunch of reasons.
> > > 
> > > To make things even more fun, the page won't become writeback magically,
> > > but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> > > they do the data=ordered IO on the buffer heads directly.  writepage is
> > > never called and the page lock is never taken, but the buffer heads go
> > > to disk.  I don't think any of the other filesystems do it this way.
> > 
> > Ok, so do you think my code handles this correctly?
> 
> Even though try_to_releasepage only checks page_writeback() the lower
> filesystems all bail on dirty pages or dirty buffers (see the checks
> done by try_to_free_buffers).
> 
> It looks like the only way we have to clean a page and all the buffers
> in it is the invalidatepage call.  But that doesn't return success or
> failure, so maybe invalidatepage followed by releasepage?
> 
> I'll have to read harder next week, the FS invalidatepage may expect
> truncate to be the only caller.

If direct de-dirty is hard for some pages, how about just ignore them?
There are the PG_writeback pages anyway. We can inject code to
intercept them at the last stage of IO request dispatching.

Some perceivable problems and solutions are
1) the intercepting overheads could be costly => inject code at runtime.
2) there are cases that the dirty page could be copied for IO:
   2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
     2.1.1) do_get_write_access(): buffer sits in two active commits
     2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
            with JBD2_MAGIC_NUMBER
   2.2) btrfs have to read page for compress/encryption
     Chris: is btrfs_zlib_compress_pages() a good place for detecting
     poison pages? Or is it necessary at all for btrfs?(ie. it's
     already relatively easy to de-dirty btrfs pages.)
   2.3) maybe more cases...

> > 
> > > If we really want the page gone, we'll have to tell the FS
> > > drop-this-or-else....sorry, its some ugly stuff.
> > 
> > I would like to give a very strong hint at least. If it fails
> > we can still ignore it, but it will likely have negative consequences later.
> > 
> 
> Nod.
> 
> > > 
> > > The good news is, it is pretty rare.  I wouldn't hold up the whole patch
> > 
> > You mean pages with Private bit are rare? Are you suggesting to just
> > ignore those? How common is it to have Private pages which are not
> > locked by someone else?
> > 
> 
> PagePrivate is very common.  try_to_releasepage failing on a clean page
> without the writeback bit set and without dirty/locked buffers will be
> pretty rare.

Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
While ext4 won't.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* btrfs BUG on creating huge sparse file
  2009-04-29  8:16                 ` Wu Fengguang
@ 2009-04-29  8:21                   ` Wu Fengguang
  -1 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29  8:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86, linux-btrfs

On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
[snip]
> > PagePrivate is very common.  try_to_releasepage failing on a clean page
> > without the writeback bit set and without dirty/locked buffers will be
> > pretty rare.
> 
> Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> While ext4 won't.

Chris, I run into a btrfs BUG() when doing

        dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345

The half created sparse file is

        -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
        Or
        -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse

Below is the kernel messages. I can test patches you throw at me :-)

Thanks,
Fengguang

[ 1067.530868] btrfs allocation failed flags 1, wanted 4096
[ 1067.536313] space_info has 0 free, is full
[ 1067.540533] space_info total=4049600512, pinned=0, delalloc=4096, may_use=0, used=4049600512
[ 1067.549280] block group 12582912 has 8388608 bytes, 8388608 used 0 pinned 0 reserved
[ 1067.557172] 0 blocks of free space at or bigger than bytes is
[ 1067.563020] block group 255918080 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.571334] 0 blocks of free space at or bigger than bytes is
[ 1067.577159] block group 709099520 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.585459] 0 blocks of free space at or bigger than bytes is
[ 1067.591271] block group 1162280960 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.599641] 0 blocks of free space at or bigger than bytes is
[ 1067.605491] block group 1615462400 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.613858] 0 blocks of free space at or bigger than bytes is
[ 1067.619684] block group 2068643840 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.628069] 0 blocks of free space at or bigger than bytes is
[ 1067.633893] block group 2521825280 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.642277] 0 blocks of free space at or bigger than bytes is
[ 1067.648099] block group 2975006720 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.656483] 0 blocks of free space at or bigger than bytes is
[ 1067.662295] block group 3428188160 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.670666] 0 blocks of free space at or bigger than bytes is
[ 1067.676508] block group 3881369600 has 415760384 bytes, 415760384 used 0 pinned 0 reserved
[ 1067.684877] 0 blocks of free space at or bigger than bytes is
[ 1067.690747] ------------[ cut here ]------------
[ 1067.695435] kernel BUG at fs/btrfs/extent-tree.c:2872!
[ 1067.700646] invalid opcode: 0000 [#1] SMP
[ 1067.704873] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/C23B/charge_full
[ 1067.714473] CPU 0
[ 1067.716575] Modules linked in: drm iwlagn iwlcore snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device snd soundcore snd_page_alloc video
[ 1067.733699] Pid: 3358, comm: dd Not tainted 2.6.30-rc2-next-20090417 #202 HP Compaq 6910p
[ 1067.741975] RIP: 0010:[<ffffffff81201b23>]  [<ffffffff81201b23>] __btrfs_reserve_extent+0x213/0x300
[ 1067.751185] RSP: 0018:ffff8800791c77f8  EFLAGS: 00010292
[ 1067.756581] RAX: 0000000000022533 RBX: ffff88007b8c5030 RCX: 0000000000000006
[ 1067.763777] RDX: ffffffff81ccffa0 RSI: ffff8800791c1db0 RDI: 0000000000000286
[ 1067.770984] RBP: ffff8800791c7878 R08: 0000000000000000 R09: 0000000000000000
[ 1067.778203] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88007b38e4b8
[ 1067.785440] R13: 0000000000001000 R14: ffff88007b38e6a8 R15: ffff88007b38e658
[ 1067.792657] FS:  00007f5801f136f0(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
[ 1067.800851] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1067.806668] CR2: 00007f58017c1622 CR3: 000000007bb62000 CR4: 00000000000006e0
[ 1067.813882] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1067.821087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1067.828304] Process dd (pid: 3358, threadinfo ffff8800791c6000, task ffff8800791c1600)
[ 1067.836319] Stack:
[ 1067.838389]  0000000000000000 ffff8800791c7948 0000000000000000 0000000000000000
[ 1067.845792]  0000000000000001 0000000000000000 0000000000000000 0000000000000000
[ 1067.853464]  ffff88007bbe4000 0000000100000000 0000000000000001 ffff8800791c7948
[ 1067.861360] Call Trace:
[ 1067.863863]  [<ffffffff81201e0b>] btrfs_reserve_extent+0x3b/0x70
[ 1067.869984]  [<ffffffff81218feb>] cow_file_range+0x21b/0x3d0
[ 1067.875745]  [<ffffffff8122f519>] ? test_range_bit+0xb9/0x180
[ 1067.881616]  [<ffffffff81219be2>] run_delalloc_range+0x302/0x3b0
[ 1067.887727]  [<ffffffff8122f519>] ? test_range_bit+0xb9/0x180
[ 1067.893583]  [<ffffffff8123352f>] ? find_lock_delalloc_range+0x12f/0x1c0
[ 1067.900396]  [<ffffffff81233c45>] __extent_writepage+0x175/0x990
[ 1067.906502]  [<ffffffff810794a8>] ? mark_held_locks+0x68/0x90
[ 1067.912361]  [<ffffffff810ca581>] ? clear_page_dirty_for_io+0x171/0x190
[ 1067.919080]  [<ffffffff810797fd>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 1067.925891]  [<ffffffff812308ce>] extent_write_cache_pages+0x1ee/0x400
[ 1067.932529]  [<ffffffff8122e970>] ? flush_write_bio+0x0/0x40
[ 1067.938288]  [<ffffffff81233ad0>] ? __extent_writepage+0x0/0x990
[ 1067.944404]  [<ffffffff810794a8>] ? mark_held_locks+0x68/0x90
[ 1067.950254]  [<ffffffff810f88e5>] ? kmem_cache_free+0x145/0x260
[ 1067.956287]  [<ffffffff810797fd>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 1067.963091]  [<ffffffff81230b22>] extent_writepages+0x42/0x70
[ 1067.968957]  [<ffffffff81217020>] ? btrfs_get_extent+0x0/0x960
[ 1067.974891]  [<ffffffff81216e58>] btrfs_writepages+0x28/0x30
[ 1067.980663]  [<ffffffff8122b940>] btrfs_fdatawrite_range+0x50/0x60
[ 1067.986942]  [<ffffffff8122c2c6>] btrfs_wait_ordered_range+0xb6/0x170
[ 1067.993508]  [<ffffffff8121cce4>] btrfs_truncate+0x74/0x160
[ 1067.999183]  [<ffffffff810dd46d>] vmtruncate+0xad/0x110
[ 1068.004529]  [<ffffffff81117095>] inode_setattr+0x35/0x180
[ 1068.010116]  [<ffffffff8121d3ab>] btrfs_setattr+0x6b/0xd0
[ 1068.015616]  [<ffffffff81117301>] notify_change+0x121/0x330
[ 1068.021298]  [<ffffffff810fd1aa>] do_truncate+0x6a/0x90
[ 1068.026623]  [<ffffffff810fd2c0>] sys_ftruncate+0xf0/0x130
[ 1068.032220]  [<ffffffff8100c2b2>] system_call_fastpath+0x16/0x1b
[ 1068.038364] Code: 4c 8d a0 60 fe ff ff 49 8b 84 24 a0 01 00 00 0f 18 08 49 8d 84 24 a0 01 00 00 49 39 c7 0f 85 8c 00 00 00 4c 89 f7 e8 8d 8e e6 ff <0f> 0b eb fe 66 0f 1f 84 00 00 00 00 00 49 d1 ed 41 8b 84 24 60
[ 1068.059472] RIP  [<ffffffff81201b23>] __btrfs_reserve_extent+0x213/0x300
[ 1068.066299]  RSP <ffff8800791c77f8>
[ 1068.070292] ---[ end trace ab42ff0a881d9568 ]---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* btrfs BUG on creating huge sparse file
@ 2009-04-29  8:21                   ` Wu Fengguang
  0 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29  8:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86, linux-btrfs

On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
[snip]
> > PagePrivate is very common.  try_to_releasepage failing on a clean page
> > without the writeback bit set and without dirty/locked buffers will be
> > pretty rare.
> 
> Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> While ext4 won't.

Chris, I run into a btrfs BUG() when doing

        dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345

The half created sparse file is

        -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
        Or
        -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse

Below is the kernel messages. I can test patches you throw at me :-)

Thanks,
Fengguang

[ 1067.530868] btrfs allocation failed flags 1, wanted 4096
[ 1067.536313] space_info has 0 free, is full
[ 1067.540533] space_info total=4049600512, pinned=0, delalloc=4096, may_use=0, used=4049600512
[ 1067.549280] block group 12582912 has 8388608 bytes, 8388608 used 0 pinned 0 reserved
[ 1067.557172] 0 blocks of free space at or bigger than bytes is
[ 1067.563020] block group 255918080 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.571334] 0 blocks of free space at or bigger than bytes is
[ 1067.577159] block group 709099520 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.585459] 0 blocks of free space at or bigger than bytes is
[ 1067.591271] block group 1162280960 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.599641] 0 blocks of free space at or bigger than bytes is
[ 1067.605491] block group 1615462400 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.613858] 0 blocks of free space at or bigger than bytes is
[ 1067.619684] block group 2068643840 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.628069] 0 blocks of free space at or bigger than bytes is
[ 1067.633893] block group 2521825280 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.642277] 0 blocks of free space at or bigger than bytes is
[ 1067.648099] block group 2975006720 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.656483] 0 blocks of free space at or bigger than bytes is
[ 1067.662295] block group 3428188160 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.670666] 0 blocks of free space at or bigger than bytes is
[ 1067.676508] block group 3881369600 has 415760384 bytes, 415760384 used 0 pinned 0 reserved
[ 1067.684877] 0 blocks of free space at or bigger than bytes is
[ 1067.690747] ------------[ cut here ]------------
[ 1067.695435] kernel BUG at fs/btrfs/extent-tree.c:2872!
[ 1067.700646] invalid opcode: 0000 [#1] SMP
[ 1067.704873] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/C23B/charge_full
[ 1067.714473] CPU 0
[ 1067.716575] Modules linked in: drm iwlagn iwlcore snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device snd soundcore snd_page_alloc video
[ 1067.733699] Pid: 3358, comm: dd Not tainted 2.6.30-rc2-next-20090417 #202 HP Compaq 6910p
[ 1067.741975] RIP: 0010:[<ffffffff81201b23>]  [<ffffffff81201b23>] __btrfs_reserve_extent+0x213/0x300
[ 1067.751185] RSP: 0018:ffff8800791c77f8  EFLAGS: 00010292
[ 1067.756581] RAX: 0000000000022533 RBX: ffff88007b8c5030 RCX: 0000000000000006
[ 1067.763777] RDX: ffffffff81ccffa0 RSI: ffff8800791c1db0 RDI: 0000000000000286
[ 1067.770984] RBP: ffff8800791c7878 R08: 0000000000000000 R09: 0000000000000000
[ 1067.778203] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88007b38e4b8
[ 1067.785440] R13: 0000000000001000 R14: ffff88007b38e6a8 R15: ffff88007b38e658
[ 1067.792657] FS:  00007f5801f136f0(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
[ 1067.800851] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1067.806668] CR2: 00007f58017c1622 CR3: 000000007bb62000 CR4: 00000000000006e0
[ 1067.813882] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1067.821087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1067.828304] Process dd (pid: 3358, threadinfo ffff8800791c6000, task ffff8800791c1600)
[ 1067.836319] Stack:
[ 1067.838389]  0000000000000000 ffff8800791c7948 0000000000000000 0000000000000000
[ 1067.845792]  0000000000000001 0000000000000000 0000000000000000 0000000000000000
[ 1067.853464]  ffff88007bbe4000 0000000100000000 0000000000000001 ffff8800791c7948
[ 1067.861360] Call Trace:
[ 1067.863863]  [<ffffffff81201e0b>] btrfs_reserve_extent+0x3b/0x70
[ 1067.869984]  [<ffffffff81218feb>] cow_file_range+0x21b/0x3d0
[ 1067.875745]  [<ffffffff8122f519>] ? test_range_bit+0xb9/0x180
[ 1067.881616]  [<ffffffff81219be2>] run_delalloc_range+0x302/0x3b0
[ 1067.887727]  [<ffffffff8122f519>] ? test_range_bit+0xb9/0x180
[ 1067.893583]  [<ffffffff8123352f>] ? find_lock_delalloc_range+0x12f/0x1c0
[ 1067.900396]  [<ffffffff81233c45>] __extent_writepage+0x175/0x990
[ 1067.906502]  [<ffffffff810794a8>] ? mark_held_locks+0x68/0x90
[ 1067.912361]  [<ffffffff810ca581>] ? clear_page_dirty_for_io+0x171/0x190
[ 1067.919080]  [<ffffffff810797fd>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 1067.925891]  [<ffffffff812308ce>] extent_write_cache_pages+0x1ee/0x400
[ 1067.932529]  [<ffffffff8122e970>] ? flush_write_bio+0x0/0x40
[ 1067.938288]  [<ffffffff81233ad0>] ? __extent_writepage+0x0/0x990
[ 1067.944404]  [<ffffffff810794a8>] ? mark_held_locks+0x68/0x90
[ 1067.950254]  [<ffffffff810f88e5>] ? kmem_cache_free+0x145/0x260
[ 1067.956287]  [<ffffffff810797fd>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 1067.963091]  [<ffffffff81230b22>] extent_writepages+0x42/0x70
[ 1067.968957]  [<ffffffff81217020>] ? btrfs_get_extent+0x0/0x960
[ 1067.974891]  [<ffffffff81216e58>] btrfs_writepages+0x28/0x30
[ 1067.980663]  [<ffffffff8122b940>] btrfs_fdatawrite_range+0x50/0x60
[ 1067.986942]  [<ffffffff8122c2c6>] btrfs_wait_ordered_range+0xb6/0x170
[ 1067.993508]  [<ffffffff8121cce4>] btrfs_truncate+0x74/0x160
[ 1067.999183]  [<ffffffff810dd46d>] vmtruncate+0xad/0x110
[ 1068.004529]  [<ffffffff81117095>] inode_setattr+0x35/0x180
[ 1068.010116]  [<ffffffff8121d3ab>] btrfs_setattr+0x6b/0xd0
[ 1068.015616]  [<ffffffff81117301>] notify_change+0x121/0x330
[ 1068.021298]  [<ffffffff810fd1aa>] do_truncate+0x6a/0x90
[ 1068.026623]  [<ffffffff810fd2c0>] sys_ftruncate+0xf0/0x130
[ 1068.032220]  [<ffffffff8100c2b2>] system_call_fastpath+0x16/0x1b
[ 1068.038364] Code: 4c 8d a0 60 fe ff ff 49 8b 84 24 a0 01 00 00 0f 18 08 49 8d 84 24 a0 01 00 00 49 39 c7 0f 85 8c 00 00 00 4c 89 f7 e8 8d 8e e6 ff <0f> 0b eb fe 66 0f 1f 84 00 00 00 00 00 49 d1 ed 41 8b 84 24 60
[ 1068.059472] RIP  [<ffffffff81201b23>] __btrfs_reserve_extent+0x213/0x300
[ 1068.066299]  RSP <ffff8800791c77f8>
[ 1068.070292] ---[ end trace ab42ff0a881d9568 ]---

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-29  8:16                 ` Wu Fengguang
@ 2009-04-29  8:36                   ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-29  8:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Andi Kleen, hugh, npiggin, riel, lee.schermerhorn,
	akpm, linux-kernel, linux-mm, x86

> > I'll have to read harder next week, the FS invalidatepage may expect
> > truncate to be the only caller.
> 
> If direct de-dirty is hard for some pages, how about just ignore them?

You mean just ignoring it for the pages where it is hard?
Yes that is what it is essentially doing right now. But at least
some dirty pages need to be handled because most user space
pages tend to be dirty.

> There are the PG_writeback pages anyway. We can inject code to
> intercept them at the last stage of IO request dispatching.

That would require adding error out code through all the file systems,
right?

> 
> Some perceivable problems and solutions are
> 1) the intercepting overheads could be costly => inject code at runtime.
> 2) there are cases that the dirty page could be copied for IO:

At some point we should probably add poison checks before these operations
yes. At least for read it should be the same code path as EIO --
you have to check PG_error anyways  (or at least you ought to)
The main difference is that for write you have to check it too.

>    2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
>      2.1.1) do_get_write_access(): buffer sits in two active commits
>      2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
>             with JBD2_MAGIC_NUMBER
>    2.2) btrfs have to read page for compress/encryption
>      Chris: is btrfs_zlib_compress_pages() a good place for detecting
>      poison pages? Or is it necessary at all for btrfs?(ie. it's
>      already relatively easy to de-dirty btrfs pages.)

I think btrfs' IO error handling is not very great right now. But once
it matures i hope poison pages can be handled in the same way as
regular IO errors.

>    2.3) maybe more cases...

Undoubtedly. Goal is just to handle the common cases that cover a lot 
of memory. This will never be 100%.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-29  8:36                   ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-04-29  8:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Andi Kleen, hugh, npiggin, riel, lee.schermerhorn,
	akpm, linux-kernel, linux-mm, x86

> > I'll have to read harder next week, the FS invalidatepage may expect
> > truncate to be the only caller.
> 
> If direct de-dirty is hard for some pages, how about just ignore them?

You mean just ignoring it for the pages where it is hard?
Yes that is what it is essentially doing right now. But at least
some dirty pages need to be handled because most user space
pages tend to be dirty.

> There are the PG_writeback pages anyway. We can inject code to
> intercept them at the last stage of IO request dispatching.

That would require adding error out code through all the file systems,
right?

> 
> Some perceivable problems and solutions are
> 1) the intercepting overheads could be costly => inject code at runtime.
> 2) there are cases that the dirty page could be copied for IO:

At some point we should probably add poison checks before these operations
yes. At least for read it should be the same code path as EIO --
you have to check PG_error anyways  (or at least you ought to)
The main difference is that for write you have to check it too.

>    2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
>      2.1.1) do_get_write_access(): buffer sits in two active commits
>      2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
>             with JBD2_MAGIC_NUMBER
>    2.2) btrfs have to read page for compress/encryption
>      Chris: is btrfs_zlib_compress_pages() a good place for detecting
>      poison pages? Or is it necessary at all for btrfs?(ie. it's
>      already relatively easy to de-dirty btrfs pages.)

I think btrfs' IO error handling is not very great right now. But once
it matures i hope poison pages can be handled in the same way as
regular IO errors.

>    2.3) maybe more cases...

Undoubtedly. Goal is just to handle the common cases that cover a lot 
of memory. This will never be 100%.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-29  8:36                   ` Andi Kleen
@ 2009-04-29  9:05                     ` Wu Fengguang
  -1 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29  9:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Mason, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Wed, Apr 29, 2009 at 04:36:55PM +0800, Andi Kleen wrote:
> > > I'll have to read harder next week, the FS invalidatepage may expect
> > > truncate to be the only caller.
> > 
> > If direct de-dirty is hard for some pages, how about just ignore them?
> 
> You mean just ignoring it for the pages where it is hard?

Yes.

> Yes that is what it is essentially doing right now. But at least
> some dirty pages need to be handled because most user space
> pages tend to be dirty.

Sure.  There are three types of dirty pages:

A. now dirty, can be de-dirty in the current code
B. now dirty, cannot be de-dirty
C. now dirty and writeback, cannot be de-dirty

I mean B and C can be handled in one single place - the block layer.

If B is hard to be de-dirtied now, ignore them for now and they will
eventually be going to IO and become C.

> > There are the PG_writeback pages anyway. We can inject code to
> > intercept them at the last stage of IO request dispatching.
> 
> That would require adding error out code through all the file systems,
> right?

Not necessarily. The file systems deal with buffer head, extend map
and bios, they normally won't touch the poisoned page content at all.

So it's mostly safe to add one single door-keeper at the low level
request dispatch queue.

> > 
> > Some perceivable problems and solutions are
> > 1) the intercepting overheads could be costly => inject code at runtime.
> > 2) there are cases that the dirty page could be copied for IO:
> 
> At some point we should probably add poison checks before these operations

Maybe some ext4 developers can drop us more hint one these two cases.
We can also do some instruments to see how often (2.1.x) will happen.

But I guess a simple PagePoison() test is cheap anyway.

> yes. At least for read it should be the same code path as EIO --
> you have to check PG_error anyways  (or at least you ought to)
> The main difference is that for write you have to check it too.

Check which on write? You mean Copy-out?

Another copy path is the bounced read/write... I guess it won't be
common in 64bit system though.

> >    2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
> >      2.1.1) do_get_write_access(): buffer sits in two active commits
> >      2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
> >             with JBD2_MAGIC_NUMBER
> >    2.2) btrfs have to read page for compress/encryption
> >      Chris: is btrfs_zlib_compress_pages() a good place for detecting
> >      poison pages? Or is it necessary at all for btrfs?(ie. it's
> >      already relatively easy to de-dirty btrfs pages.)
> 
> I think btrfs' IO error handling is not very great right now. But once
> it matures i hope poison pages can be handled in the same way as
> regular IO errors.

OK.

> >    2.3) maybe more cases...
> 
> Undoubtedly. Goal is just to handle the common cases that cover a lot 
> of memory. This will never be 100%.

Right. We'll discover/cover more cases as time goes by.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-29  9:05                     ` Wu Fengguang
  0 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29  9:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Mason, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Wed, Apr 29, 2009 at 04:36:55PM +0800, Andi Kleen wrote:
> > > I'll have to read harder next week, the FS invalidatepage may expect
> > > truncate to be the only caller.
> > 
> > If direct de-dirty is hard for some pages, how about just ignore them?
> 
> You mean just ignoring it for the pages where it is hard?

Yes.

> Yes that is what it is essentially doing right now. But at least
> some dirty pages need to be handled because most user space
> pages tend to be dirty.

Sure.  There are three types of dirty pages:

A. now dirty, can be de-dirty in the current code
B. now dirty, cannot be de-dirty
C. now dirty and writeback, cannot be de-dirty

I mean B and C can be handled in one single place - the block layer.

If B is hard to be de-dirtied now, ignore them for now and they will
eventually be going to IO and become C.

> > There are the PG_writeback pages anyway. We can inject code to
> > intercept them at the last stage of IO request dispatching.
> 
> That would require adding error out code through all the file systems,
> right?

Not necessarily. The file systems deal with buffer head, extend map
and bios, they normally won't touch the poisoned page content at all.

So it's mostly safe to add one single door-keeper at the low level
request dispatch queue.

> > 
> > Some perceivable problems and solutions are
> > 1) the intercepting overheads could be costly => inject code at runtime.
> > 2) there are cases that the dirty page could be copied for IO:
> 
> At some point we should probably add poison checks before these operations

Maybe some ext4 developers can drop us more hint one these two cases.
We can also do some instruments to see how often (2.1.x) will happen.

But I guess a simple PagePoison() test is cheap anyway.

> yes. At least for read it should be the same code path as EIO --
> you have to check PG_error anyways  (or at least you ought to)
> The main difference is that for write you have to check it too.

Check which on write? You mean Copy-out?

Another copy path is the bounced read/write... I guess it won't be
common in 64bit system though.

> >    2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
> >      2.1.1) do_get_write_access(): buffer sits in two active commits
> >      2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
> >             with JBD2_MAGIC_NUMBER
> >    2.2) btrfs have to read page for compress/encryption
> >      Chris: is btrfs_zlib_compress_pages() a good place for detecting
> >      poison pages? Or is it necessary at all for btrfs?(ie. it's
> >      already relatively easy to de-dirty btrfs pages.)
> 
> I think btrfs' IO error handling is not very great right now. But once
> it matures i hope poison pages can be handled in the same way as
> regular IO errors.

OK.

> >    2.3) maybe more cases...
> 
> Undoubtedly. Goal is just to handle the common cases that cover a lot 
> of memory. This will never be 100%.

Right. We'll discover/cover more cases as time goes by.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
  2009-04-29  9:05                     ` Wu Fengguang
@ 2009-04-29 11:27                       ` Chris Mason
  -1 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-29 11:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Wed, 2009-04-29 at 17:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 04:36:55PM +0800, Andi Kleen wrote:
> > > > I'll have to read harder next week, the FS invalidatepage may expect
> > > > truncate to be the only caller.
> > > 
> > > If direct de-dirty is hard for some pages, how about just ignore them?
> > 
> > You mean just ignoring it for the pages where it is hard?
> 
> Yes.
> 
> > Yes that is what it is essentially doing right now. But at least
> > some dirty pages need to be handled because most user space
> > pages tend to be dirty.
> 
> Sure.  There are three types of dirty pages:
> 
> A. now dirty, can be de-dirty in the current code
> B. now dirty, cannot be de-dirty
> C. now dirty and writeback, cannot be de-dirty
> 
> I mean B and C can be handled in one single place - the block layer.
> 
> If B is hard to be de-dirtied now, ignore them for now and they will
> eventually be going to IO and become C.
> 
> > > There are the PG_writeback pages anyway. We can inject code to
> > > intercept them at the last stage of IO request dispatching.
> > 
> > That would require adding error out code through all the file systems,
> > right?
> 
> Not necessarily. The file systems deal with buffer head, extend map
> and bios, they normally won't touch the poisoned page content at all.
> 

They often do when zeroing parts of the page that straddle i_size.  At
least for btrfs its enough to change grab_cache_page and find_get_page
(and friends) to do the poison magic, along with the functions uses by
write_cache_pages.

-chris



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [13/16] POISON: The high level memory error handler in the VM II
@ 2009-04-29 11:27                       ` Chris Mason
  0 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-29 11:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86

On Wed, 2009-04-29 at 17:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 04:36:55PM +0800, Andi Kleen wrote:
> > > > I'll have to read harder next week, the FS invalidatepage may expect
> > > > truncate to be the only caller.
> > > 
> > > If direct de-dirty is hard for some pages, how about just ignore them?
> > 
> > You mean just ignoring it for the pages where it is hard?
> 
> Yes.
> 
> > Yes that is what it is essentially doing right now. But at least
> > some dirty pages need to be handled because most user space
> > pages tend to be dirty.
> 
> Sure.  There are three types of dirty pages:
> 
> A. now dirty, can be de-dirty in the current code
> B. now dirty, cannot be de-dirty
> C. now dirty and writeback, cannot be de-dirty
> 
> I mean B and C can be handled in one single place - the block layer.
> 
> If B is hard to be de-dirtied now, ignore them for now and they will
> eventually be going to IO and become C.
> 
> > > There are the PG_writeback pages anyway. We can inject code to
> > > intercept them at the last stage of IO request dispatching.
> > 
> > That would require adding error out code through all the file systems,
> > right?
> 
> Not necessarily. The file systems deal with buffer head, extend map
> and bios, they normally won't touch the poisoned page content at all.
> 

They often do when zeroing parts of the page that straddle i_size.  At
least for btrfs its enough to change grab_cache_page and find_get_page
(and friends) to do the poison magic, along with the functions uses by
write_cache_pages.

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: btrfs BUG on creating huge sparse file
  2009-04-29  8:21                   ` Wu Fengguang
@ 2009-04-29 11:40                     ` Chris Mason
  -1 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-29 11:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86, linux-btrfs

On Wed, 2009-04-29 at 16:21 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> > On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> [snip]
> > > PagePrivate is very common.  try_to_releasepage failing on a clean page
> > > without the writeback bit set and without dirty/locked buffers will be
> > > pretty rare.
> > 
> > Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> > While ext4 won't.
> 
> Chris, I run into a btrfs BUG() when doing
> 
>         dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
> 
> The half created sparse file is
> 
>         -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
>         Or
>         -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
> 
> Below is the kernel messages. I can test patches you throw at me :-)
> 

How big was the FS you were testing this on?  It works for me...

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: btrfs BUG on creating huge sparse file
@ 2009-04-29 11:40                     ` Chris Mason
  0 siblings, 0 replies; 150+ messages in thread
From: Chris Mason @ 2009-04-29 11:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86, linux-btrfs

On Wed, 2009-04-29 at 16:21 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> > On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> [snip]
> > > PagePrivate is very common.  try_to_releasepage failing on a clean page
> > > without the writeback bit set and without dirty/locked buffers will be
> > > pretty rare.
> > 
> > Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> > While ext4 won't.
> 
> Chris, I run into a btrfs BUG() when doing
> 
>         dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
> 
> The half created sparse file is
> 
>         -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
>         Or
>         -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
> 
> Below is the kernel messages. I can test patches you throw at me :-)
> 

How big was the FS you were testing this on?  It works for me...

-chris



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: btrfs BUG on creating huge sparse file
  2009-04-29 11:40                     ` Chris Mason
@ 2009-04-29 11:45                       ` Wu Fengguang
  -1 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29 11:45 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86, linux-btrfs

On Wed, Apr 29, 2009 at 07:40:22PM +0800, Chris Mason wrote:
> On Wed, 2009-04-29 at 16:21 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> > > On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> > [snip]
> > > > PagePrivate is very common.  try_to_releasepage failing on a clean page
> > > > without the writeback bit set and without dirty/locked buffers will be
> > > > pretty rare.
> > > 
> > > Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> > > While ext4 won't.
> > 
> > Chris, I run into a btrfs BUG() when doing
> > 
> >         dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
> > 
> > The half created sparse file is
> > 
> >         -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
> >         Or
> >         -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
> > 
> > Below is the kernel messages. I can test patches you throw at me :-)
> > 
> 
> How big was the FS you were testing this on?  It works for me...

df says:

/dev/sda3             4.3G   28K  4.3G   1% /b

Oh bad, I cannot reproduce it now..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: btrfs BUG on creating huge sparse file
@ 2009-04-29 11:45                       ` Wu Fengguang
  0 siblings, 0 replies; 150+ messages in thread
From: Wu Fengguang @ 2009-04-29 11:45 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, hugh, npiggin, riel, lee.schermerhorn, akpm,
	linux-kernel, linux-mm, x86, linux-btrfs

On Wed, Apr 29, 2009 at 07:40:22PM +0800, Chris Mason wrote:
> On Wed, 2009-04-29 at 16:21 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> > > On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> > [snip]
> > > > PagePrivate is very common.  try_to_releasepage failing on a clean page
> > > > without the writeback bit set and without dirty/locked buffers will be
> > > > pretty rare.
> > > 
> > > Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> > > While ext4 won't.
> > 
> > Chris, I run into a btrfs BUG() when doing
> > 
> >         dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
> > 
> > The half created sparse file is
> > 
> >         -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
> >         Or
> >         -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
> > 
> > Below is the kernel messages. I can test patches you throw at me :-)
> > 
> 
> How big was the FS you were testing this on?  It works for me...

df says:

/dev/sda3             4.3G   28K  4.3G   1% /b

Oh bad, I cannot reproduce it now..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-04-07 15:09 ` Andi Kleen
@ 2009-05-26 12:50   ` Hidehiro Kawai
  -1 siblings, 0 replies; 150+ messages in thread
From: Hidehiro Kawai @ 2009-05-26 12:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, x86, Satoshi OSHIMA, Taketoshi Sakuraba

Hi all,
(I'm sorry for the very late comment.)

Andi Kleen wrote:

> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.

I think this patch set is a great step to get the Linux system
reliable close to mainframe.  I really appreciate your work.

I believe people concerning high reliable system are expecting
this kind of functionality.
But I wonder why this patch set (including former MCE improvements
patches) has not been merged into any subsystem trees yet.
What is the problem?  Because of the deadlock bug and the ref counter
problem?  Or are we waiting for 32bit unification to complete?
If so, I'd like to try to narrow down the problems or review
patches (although I'm afraid I'm not so skillful).

BTW, I looked over this patch set, and I couldn't
find any problems except for one minor point.  I'll post
a comment about it later.  It is very late, but better than nothing.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-05-26 12:50   ` Hidehiro Kawai
  0 siblings, 0 replies; 150+ messages in thread
From: Hidehiro Kawai @ 2009-05-26 12:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, x86, Satoshi OSHIMA, Taketoshi Sakuraba

Hi all,
(I'm sorry for the very late comment.)

Andi Kleen wrote:

> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.

I think this patch set is a great step to get the Linux system
reliable close to mainframe.  I really appreciate your work.

I believe people concerning high reliable system are expecting
this kind of functionality.
But I wonder why this patch set (including former MCE improvements
patches) has not been merged into any subsystem trees yet.
What is the problem?  Because of the deadlock bug and the ref counter
problem?  Or are we waiting for 32bit unification to complete?
If so, I'd like to try to narrow down the problems or review
patches (although I'm afraid I'm not so skillful).

BTW, I looked over this patch set, and I couldn't
find any problems except for one minor point.  I'll post
a comment about it later.  It is very late, but better than nothing.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler
  2009-04-07 15:10   ` Andi Kleen
@ 2009-05-26 12:55     ` Hidehiro Kawai
  -1 siblings, 0 replies; 150+ messages in thread
From: Hidehiro Kawai @ 2009-05-26 12:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, x86, Satoshi OSHIMA, Taketoshi Sakuraba

Andi Kleen wrote:

> - Add a new VM_FAULT_POISON error code to handle_mm_fault. Right now
> architectures have to explicitely enable poison page support, so
> this is forward compatible to all architectures. They only need
> to add it when they enable poison page support.
> - Add poison page handling in swap in fault code
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/mm.h |    3 ++-
>  mm/memory.c        |   17 ++++++++++++++---
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2009-04-07 16:39:24.000000000 +0200
> +++ linux/mm/memory.c	2009-04-07 16:43:06.000000000 +0200
> @@ -1315,7 +1315,8 @@
>  				if (ret & VM_FAULT_ERROR) {
>  					if (ret & VM_FAULT_OOM)
>  						return i ? i : -ENOMEM;
> -					else if (ret & VM_FAULT_SIGBUS)
> +					if (ret &
> +					    (VM_FAULT_POISON|VM_FAULT_SIGBUS))
>  						return i ? i : -EFAULT;
>  					BUG();
>  				}
> @@ -2426,8 +2427,15 @@
>  		goto out;
>  
>  	entry = pte_to_swp_entry(orig_pte);
> -	if (is_migration_entry(entry)) {
> -		migration_entry_wait(mm, pmd, address);
> +	if (unlikely(non_swap_entry(entry))) {
> +		if (is_migration_entry(entry)) {
> +			migration_entry_wait(mm, pmd, address);
> +		} else if (is_poison_entry(entry)) {
> +			ret = VM_FAULT_POISON;
> +		} else {
> +			print_bad_pte(vma, address, pte, NULL);
> +			ret = VM_FAULT_OOM;
> +		}
>  		goto out;
>  	}
>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> @@ -2451,6 +2459,9 @@
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		count_vm_event(PGMAJFAULT);
> +	} else if (PagePoison(page)) {
> +		ret = VM_FAULT_POISON;

delayacct_clear_flag(DELAYACCT_PF_SWAPIN) would be needed here.

> +		goto out;
>  	}
>  
>  	lock_page(page);

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler
@ 2009-05-26 12:55     ` Hidehiro Kawai
  0 siblings, 0 replies; 150+ messages in thread
From: Hidehiro Kawai @ 2009-05-26 12:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, x86, Satoshi OSHIMA, Taketoshi Sakuraba

Andi Kleen wrote:

> - Add a new VM_FAULT_POISON error code to handle_mm_fault. Right now
> architectures have to explicitely enable poison page support, so
> this is forward compatible to all architectures. They only need
> to add it when they enable poison page support.
> - Add poison page handling in swap in fault code
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/mm.h |    3 ++-
>  mm/memory.c        |   17 ++++++++++++++---
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2009-04-07 16:39:24.000000000 +0200
> +++ linux/mm/memory.c	2009-04-07 16:43:06.000000000 +0200
> @@ -1315,7 +1315,8 @@
>  				if (ret & VM_FAULT_ERROR) {
>  					if (ret & VM_FAULT_OOM)
>  						return i ? i : -ENOMEM;
> -					else if (ret & VM_FAULT_SIGBUS)
> +					if (ret &
> +					    (VM_FAULT_POISON|VM_FAULT_SIGBUS))
>  						return i ? i : -EFAULT;
>  					BUG();
>  				}
> @@ -2426,8 +2427,15 @@
>  		goto out;
>  
>  	entry = pte_to_swp_entry(orig_pte);
> -	if (is_migration_entry(entry)) {
> -		migration_entry_wait(mm, pmd, address);
> +	if (unlikely(non_swap_entry(entry))) {
> +		if (is_migration_entry(entry)) {
> +			migration_entry_wait(mm, pmd, address);
> +		} else if (is_poison_entry(entry)) {
> +			ret = VM_FAULT_POISON;
> +		} else {
> +			print_bad_pte(vma, address, pte, NULL);
> +			ret = VM_FAULT_OOM;
> +		}
>  		goto out;
>  	}
>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> @@ -2451,6 +2459,9 @@
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		count_vm_event(PGMAJFAULT);
> +	} else if (PagePoison(page)) {
> +		ret = VM_FAULT_POISON;

delayacct_clear_flag(DELAYACCT_PF_SWAPIN) would be needed here.

> +		goto out;
>  	}
>  
>  	lock_page(page);

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler
  2009-05-26 12:55     ` Hidehiro Kawai
@ 2009-05-26 13:18       ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-05-26 13:18 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, linux-kernel, linux-mm, x86, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Tue, May 26, 2009 at 09:55:26PM +0900, Hidehiro Kawai wrote:
> > +			print_bad_pte(vma, address, pte, NULL);
> > +			ret = VM_FAULT_OOM;
> > +		}
> >  		goto out;
> >  	}
> >  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> > @@ -2451,6 +2459,9 @@
> >  		/* Had to read the page from swap area: Major fault */
> >  		ret = VM_FAULT_MAJOR;
> >  		count_vm_event(PGMAJFAULT);
> > +	} else if (PagePoison(page)) {
> > +		ret = VM_FAULT_POISON;
> 
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN) would be needed here.

Thanks for the review. Added.

Must have been a forward port error, I could swear that wasn't there
yet when I wrote this originally :)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler
@ 2009-05-26 13:18       ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-05-26 13:18 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, linux-kernel, linux-mm, x86, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Tue, May 26, 2009 at 09:55:26PM +0900, Hidehiro Kawai wrote:
> > +			print_bad_pte(vma, address, pte, NULL);
> > +			ret = VM_FAULT_OOM;
> > +		}
> >  		goto out;
> >  	}
> >  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> > @@ -2451,6 +2459,9 @@
> >  		/* Had to read the page from swap area: Major fault */
> >  		ret = VM_FAULT_MAJOR;
> >  		count_vm_event(PGMAJFAULT);
> > +	} else if (PagePoison(page)) {
> > +		ret = VM_FAULT_POISON;
> 
> delayacct_clear_flag(DELAYACCT_PF_SWAPIN) would be needed here.

Thanks for the review. Added.

Must have been a forward port error, I could swear that wasn't there
yet when I wrote this originally :)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-05-26 12:50   ` Hidehiro Kawai
@ 2009-05-26 13:29     ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-05-26 13:29 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, linux-kernel, linux-mm, x86, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Tue, May 26, 2009 at 09:50:18PM +0900, Hidehiro Kawai wrote:
> I believe people concerning high reliable system are expecting
> this kind of functionality.
> But I wonder why this patch set (including former MCE improvements
> patches) has not been merged into any subsystem trees yet.


> What is the problem?  Because of the deadlock bug and the ref counter

I hadn't asked for a mm merge for the hwpoison version because
it still needed some work. mce has been ready for some time for
merge, although of course as we do more testing we still
find occasional bugs that are getting fixed.

There was some work recently on fixing problems found in the
hwpoison code during further review (me with Fengguang Wu). 
I'm hoping to do a repost with all the fixes soon
and then it's a mm candidate and hopefully ready for merge really soon.

Also there was a lot of work (mostly by Ying Huang) on the
mce-test testsuite which is covering more and more code, 
but of course could always need more work too.

> problem?  Or are we waiting for 32bit unification to complete?

The 32bit unification is complete, but the x86 maintainers
haven't merged it yet. 

> If so, I'd like to try to narrow down the problems or review
> patches (although I'm afraid I'm not so skillful).

Sure any review or additional testing is welcome.

I wanted to do full reposts this week anyways, so you
can start from there again.

> BTW, I looked over this patch set, and I couldn't
> find any problems except for one minor point.  I'll post
> a comment about it later.  It is very late, but better than nothing.

Great. Thanks. Can I add your Reviewed-by tags then?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-05-26 13:29     ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-05-26 13:29 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, linux-kernel, linux-mm, x86, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Tue, May 26, 2009 at 09:50:18PM +0900, Hidehiro Kawai wrote:
> I believe people concerning high reliable system are expecting
> this kind of functionality.
> But I wonder why this patch set (including former MCE improvements
> patches) has not been merged into any subsystem trees yet.


> What is the problem?  Because of the deadlock bug and the ref counter

I hadn't asked for a mm merge for the hwpoison version because
it still needed some work. mce has been ready for some time for
merge, although of course as we do more testing we still
find occasional bugs that are getting fixed.

There was some work recently on fixing problems found in the
hwpoison code during further review (me with Fengguang Wu). 
I'm hoping to do a repost with all the fixes soon
and then it's a mm candidate and hopefully ready for merge really soon.

Also there was a lot of work (mostly by Ying Huang) on the
mce-test testsuite which is covering more and more code, 
but of course could always need more work too.

> problem?  Or are we waiting for 32bit unification to complete?

The 32bit unification is complete, but the x86 maintainers
haven't merged it yet. 

> If so, I'd like to try to narrow down the problems or review
> patches (although I'm afraid I'm not so skillful).

Sure any review or additional testing is welcome.

I wanted to do full reposts this week anyways, so you
can start from there again.

> BTW, I looked over this patch set, and I couldn't
> find any problems except for one minor point.  I'll post
> a comment about it later.  It is very late, but better than nothing.

Great. Thanks. Can I add your Reviewed-by tags then?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-05-26 13:29     ` Andi Kleen
@ 2009-05-28  4:37       ` Hidehiro Kawai
  -1 siblings, 0 replies; 150+ messages in thread
From: Hidehiro Kawai @ 2009-05-28  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, x86, Satoshi OSHIMA, Taketoshi Sakuraba

Andi Kleen wrote:

> On Tue, May 26, 2009 at 09:50:18PM +0900, Hidehiro Kawai wrote:
> 
>>I believe people concerning high reliable system are expecting
>>this kind of functionality.
>>But I wonder why this patch set (including former MCE improvements
>>patches) has not been merged into any subsystem trees yet.
> 
>>What is the problem?  Because of the deadlock bug and the ref counter

> There was some work recently on fixing problems found in the
> hwpoison code during further review (me with Fengguang Wu). 
> I'm hoping to do a repost with all the fixes soon
> and then it's a mm candidate and hopefully ready for merge really soon.

Thank you, Andi.

>>If so, I'd like to try to narrow down the problems or review
>>patches (although I'm afraid I'm not so skillful).
> 
> Sure any review or additional testing is welcome.
> 
> I wanted to do full reposts this week anyways, so you
> can start from there again.

OK, I'll do that.
 
>>BTW, I looked over this patch set, and I couldn't
>>find any problems except for one minor point.  I'll post
>>a comment about it later.  It is very late, but better than nothing.
> 
> Great. Thanks. Can I add your Reviewed-by tags then?

Yes, of course.

Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-05-28  4:37       ` Hidehiro Kawai
  0 siblings, 0 replies; 150+ messages in thread
From: Hidehiro Kawai @ 2009-05-28  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-mm, x86, Satoshi OSHIMA, Taketoshi Sakuraba

Andi Kleen wrote:

> On Tue, May 26, 2009 at 09:50:18PM +0900, Hidehiro Kawai wrote:
> 
>>I believe people concerning high reliable system are expecting
>>this kind of functionality.
>>But I wonder why this patch set (including former MCE improvements
>>patches) has not been merged into any subsystem trees yet.
> 
>>What is the problem?  Because of the deadlock bug and the ref counter

> There was some work recently on fixing problems found in the
> hwpoison code during further review (me with Fengguang Wu). 
> I'm hoping to do a repost with all the fixes soon
> and then it's a mm candidate and hopefully ready for merge really soon.

Thank you, Andi.

>>If so, I'd like to try to narrow down the problems or review
>>patches (although I'm afraid I'm not so skillful).
> 
> Sure any review or additional testing is welcome.
> 
> I wanted to do full reposts this week anyways, so you
> can start from there again.

OK, I'll do that.
 
>>BTW, I looked over this patch set, and I couldn't
>>find any problems except for one minor point.  I'll post
>>a comment about it later.  It is very late, but better than nothing.
> 
> Great. Thanks. Can I add your Reviewed-by tags then?

Yes, of course.

Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
  2009-05-28  4:37       ` Hidehiro Kawai
@ 2009-05-28  8:00         ` Andi Kleen
  -1 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-05-28  8:00 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, linux-kernel, linux-mm, x86, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Thu, May 28, 2009 at 01:37:38PM +0900, Hidehiro Kawai wrote:
> >>BTW, I looked over this patch set, and I couldn't
> >>find any problems except for one minor point.  I'll post
> >>a comment about it later.  It is very late, but better than nothing.
> > 
> > Great. Thanks. Can I add your Reviewed-by tags then?
> 
> Yes, of course.

Sorry I posted it before seeing your email. If you could take
a look at the updated patchkit too that would be great?

> Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>

I will add that thanks.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH] [0/16] POISON: Intro
@ 2009-05-28  8:00         ` Andi Kleen
  0 siblings, 0 replies; 150+ messages in thread
From: Andi Kleen @ 2009-05-28  8:00 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, linux-kernel, linux-mm, x86, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Thu, May 28, 2009 at 01:37:38PM +0900, Hidehiro Kawai wrote:
> >>BTW, I looked over this patch set, and I couldn't
> >>find any problems except for one minor point.  I'll post
> >>a comment about it later.  It is very late, but better than nothing.
> > 
> > Great. Thanks. Can I add your Reviewed-by tags then?
> 
> Yes, of course.

Sorry I posted it before seeing your email. If you could take
a look at the updated patchkit too that would be great?

> Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>

I will add that thanks.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 150+ messages in thread

end of thread, other threads:[~2009-05-28  7:54 UTC | newest]

Thread overview: 150+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-07 15:09 [PATCH] [0/16] POISON: Intro Andi Kleen
2009-04-07 15:09 ` Andi Kleen
2009-04-07 15:09 ` [PATCH] [1/16] POISON: Add support for high priority work items Andi Kleen
2009-04-07 15:09   ` Andi Kleen
2009-04-07 15:09 ` [PATCH] [2/16] POISON: Add page flag for poisoned pages Andi Kleen
2009-04-07 15:09   ` Andi Kleen
2009-04-07 21:07   ` Christoph Lameter
2009-04-07 21:07     ` Christoph Lameter
2009-04-08  0:29   ` Russ Anderson
2009-04-08  0:29     ` Russ Anderson
2009-04-08  6:26     ` Andi Kleen
2009-04-08  6:26       ` Andi Kleen
2009-04-08  5:14   ` Andrew Morton
2009-04-08  5:14     ` Andrew Morton
2009-04-08  6:24     ` Andi Kleen
2009-04-08  6:24       ` Andi Kleen
2009-04-08  7:00       ` Andrew Morton
2009-04-08  7:00         ` Andrew Morton
2009-04-08  9:38         ` Andi Kleen
2009-04-08  9:38           ` Andi Kleen
2009-04-07 15:09 ` [PATCH] [3/16] POISON: Handle poisoned pages in page free Andi Kleen
2009-04-07 15:09   ` Andi Kleen
2009-04-07 23:21   ` Minchan Kim
2009-04-07 23:21     ` Minchan Kim
2009-04-08  6:51     ` Andi Kleen
2009-04-08  6:51       ` Andi Kleen
2009-04-08  7:39       ` Minchan Kim
2009-04-08  7:39         ` Minchan Kim
2009-04-08  9:41         ` Andi Kleen
2009-04-08  9:41           ` Andi Kleen
2009-04-08 10:05           ` Minchan Kim
2009-04-08 10:05             ` Minchan Kim
2009-04-07 15:10 ` [PATCH] [4/16] POISON: Export some rmap vma locking to outside world Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [5/16] POISON: Add support for poison swap entries Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 21:11   ` Christoph Lameter
2009-04-07 21:11     ` Christoph Lameter
2009-04-07 21:56     ` Andi Kleen
2009-04-07 21:56       ` Andi Kleen
2009-04-07 21:56       ` Christoph Lameter
2009-04-07 21:56         ` Christoph Lameter
2009-04-07 22:25         ` Andi Kleen
2009-04-07 22:25           ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [6/16] POISON: Add new SIGBUS error codes for poison signals Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [7/16] POISON: Add basic support for poisoned pages in fault handler Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-05-26 12:55   ` Hidehiro Kawai
2009-05-26 12:55     ` Hidehiro Kawai
2009-05-26 13:18     ` Andi Kleen
2009-05-26 13:18       ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [8/16] POISON: Add various poison checks in mm/memory.c Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 19:03   ` Johannes Weiner
2009-04-07 19:03     ` Johannes Weiner
2009-04-07 19:31     ` Andi Kleen
2009-04-07 19:31       ` Andi Kleen
2009-04-07 20:17       ` Johannes Weiner
2009-04-07 20:17         ` Johannes Weiner
2009-04-07 20:24         ` Andi Kleen
2009-04-07 20:24           ` Andi Kleen
2009-04-07 20:36           ` Johannes Weiner
2009-04-07 20:36             ` Johannes Weiner
2009-04-07 15:10 ` [PATCH] [9/16] POISON: x86: Add VM_FAULT_POISON handling to x86 page fault handler Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [10/16] POISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 21:19   ` Christoph Lameter
2009-04-07 21:19     ` Christoph Lameter
2009-04-07 21:59     ` Andi Kleen
2009-04-07 21:59       ` Andi Kleen
2009-04-07 22:04       ` Christoph Lameter
2009-04-07 22:04         ` Christoph Lameter
2009-04-07 22:35         ` Andi Kleen
2009-04-07 22:35           ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [11/16] POISON: Handle poisoned pages in try_to_unmap Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [12/16] POISON: Handle poisoned pages in set_page_dirty() Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [13/16] POISON: The high level memory error handler in the VM Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 16:03   ` Rik van Riel
2009-04-07 16:03     ` Rik van Riel
2009-04-07 16:30     ` Andi Kleen
2009-04-07 16:30       ` Andi Kleen
2009-04-07 18:51   ` Johannes Weiner
2009-04-07 18:51     ` Johannes Weiner
2009-04-07 19:40     ` Andi Kleen
2009-04-07 19:40       ` Andi Kleen
2009-04-08 17:03   ` Chris Mason
2009-04-08 17:03     ` Chris Mason
2009-04-09  7:29     ` Andi Kleen
2009-04-09  7:29       ` Andi Kleen
2009-04-09  7:58       ` [PATCH] [13/16] POISON: The high level memory error handler in the VM II Andi Kleen
2009-04-09  7:58         ` Andi Kleen
2009-04-09 13:30         ` Chris Mason
2009-04-09 13:30           ` Chris Mason
2009-04-09 14:02           ` Andi Kleen
2009-04-09 14:02             ` Andi Kleen
2009-04-09 14:37             ` Chris Mason
2009-04-09 14:37               ` Chris Mason
2009-04-09 14:57               ` Andi Kleen
2009-04-09 14:57                 ` Andi Kleen
2009-04-29  8:16               ` Wu Fengguang
2009-04-29  8:16                 ` Wu Fengguang
2009-04-29  8:21                 ` btrfs BUG on creating huge sparse file Wu Fengguang
2009-04-29  8:21                   ` Wu Fengguang
2009-04-29 11:40                   ` Chris Mason
2009-04-29 11:40                     ` Chris Mason
2009-04-29 11:45                     ` Wu Fengguang
2009-04-29 11:45                       ` Wu Fengguang
2009-04-29  8:36                 ` [PATCH] [13/16] POISON: The high level memory error handler in the VM II Andi Kleen
2009-04-29  8:36                   ` Andi Kleen
2009-04-29  9:05                   ` Wu Fengguang
2009-04-29  9:05                     ` Wu Fengguang
2009-04-29 11:27                     ` Chris Mason
2009-04-29 11:27                       ` Chris Mason
2009-04-07 15:10 ` [PATCH] [14/16] x86: MCE: Rename mce_notify_user to mce_notify_irq Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [15/16] x86: MCE: Support action-optional machine checks Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 15:10 ` [PATCH] [16/16] POISON: Add madvise() based injector for poisoned data Andi Kleen
2009-04-07 15:10   ` Andi Kleen
2009-04-07 19:13 ` [PATCH] [0/16] POISON: Intro Robin Holt
2009-04-07 19:13   ` Robin Holt
2009-04-07 19:38   ` Andi Kleen
2009-04-07 19:38     ` Andi Kleen
2009-04-08  5:15 ` Andrew Morton
2009-04-08  5:15   ` Andrew Morton
2009-04-08  6:15   ` Andi Kleen
2009-04-08  6:15     ` Andi Kleen
2009-04-08 17:29     ` Roland Dreier
2009-04-08 17:29       ` Roland Dreier
2009-04-09  7:22       ` Andi Kleen
2009-04-09  7:22         ` Andi Kleen
2009-04-08  5:47 ` Andrew Morton
2009-04-08  5:47   ` Andrew Morton
2009-04-08  6:21   ` Andi Kleen
2009-04-08  6:21     ` Andi Kleen
2009-04-13 13:18   ` Wu Fengguang
2009-04-13 13:18     ` Wu Fengguang
2009-05-26 12:50 ` Hidehiro Kawai
2009-05-26 12:50   ` Hidehiro Kawai
2009-05-26 13:29   ` Andi Kleen
2009-05-26 13:29     ` Andi Kleen
2009-05-28  4:37     ` Hidehiro Kawai
2009-05-28  4:37       ` Hidehiro Kawai
2009-05-28  8:00       ` Andi Kleen
2009-05-28  8:00         ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.