All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [0/16] HWPOISON: Intro
@ 2009-05-27 20:12 ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


This is the latest version of the hwpoison patch. It has
a lot of fixes and improvements and review/testing over the last 
version.

A lot of thanks to Fengguang Wu for doing a lot of great
improvements, like fixing quite a lot of problems
and implementing free page handling.

It's also standalone now, not relying on any
other patchkits. Standalone it's only usable through
the debugging injection interfaces, but architectures
can (and do) make use of it.

It's also fairly unintruisive, as you can see. 
It doesn't really change any existing code paths 
significantly.

I believe this version is now ready for merging.

Any additional review/comments/etc of course welcome.

Andrew, can you please consider it for merging into -mm
for the 2.6.31 track?

The patchkit is also available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison

-Andi

---

Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", 
kill the processes associated with it and avoid using it in the future. 

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.


-Andi

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [0/16] HWPOISON: Intro
@ 2009-05-27 20:12 ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


This is the latest version of the hwpoison patch. It has
a lot of fixes and improvements and review/testing over the last 
version.

A lot of thanks to Fengguang Wu for doing a lot of great
improvements, like fixing quite a lot of problems
and implementing free page handling.

It's also standalone now, not relying on any
other patchkits. Standalone it's only usable through
the debugging injection interfaces, but architectures
can (and do) make use of it.

It's also fairly unintruisive, as you can see. 
It doesn't really change any existing code paths 
significantly.

I believe this version is now ready for merging.

Any additional review/comments/etc of course welcome.

Andrew, can you please consider it for merging into -mm
for the 2.6.31 track?

The patchkit is also available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison

-Andi

---

Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", 
kill the processes associated with it and avoid using it in the future. 

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.


-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Hardware poisoned pages need special handling in the VM and shouldn't be 
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/page-flags.h	2009-05-27 21:14:21.000000000 +0200
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -104,6 +107,9 @@
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_hwpoison,		/* hardware poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +279,15 @@
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +418,7 @@
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 __PG_UNEVICTABLE | __PG_MLOCKED)
+	 __PG_HWPOISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Hardware poisoned pages need special handling in the VM and shouldn't be 
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/page-flags.h	2009-05-27 21:14:21.000000000 +0200
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -104,6 +107,9 @@
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_hwpoison,		/* hardware poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +279,15 @@
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +418,7 @@
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 __PG_UNEVICTABLE | __PG_MLOCKED)
+	 __PG_HWPOISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: fengguang.wu, akpm, linux-kernel, linux-mm, fengguang.wu


From: Fengguang Wu <fengguang.wu@intel.com>

Export the new poison flag in /proc/kpageflags. Poisoned pages are moderately
interesting even for administrators, so export them here. Also useful
for debugging.

AK: I extracted this out of a larger patch from Fengguang Wu.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/page.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/fs/proc/page.c
===================================================================
--- linux.orig/fs/proc/page.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/fs/proc/page.c	2009-05-27 21:14:21.000000000 +0200
@@ -79,6 +79,7 @@
 #define KPF_WRITEBACK  8
 #define KPF_RECLAIM    9
 #define KPF_BUDDY     10
+#define KPF_HWPOISON  11
 
 #define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
 
@@ -118,6 +119,9 @@
 			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
+#ifdef CONFIG_MEMORY_FAILURE
+		uflags |= kpf_copy_bit(kflags, KPF_HWPOISON, PG_hwpoison);
+#endif
 
 		if (put_user(uflags, out++)) {
 			ret = -EFAULT;

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: fengguang.wu, akpm, linux-kernel, linux-mm


From: Fengguang Wu <fengguang.wu@intel.com>

Export the new poison flag in /proc/kpageflags. Poisoned pages are moderately
interesting even for administrators, so export them here. Also useful
for debugging.

AK: I extracted this out of a larger patch from Fengguang Wu.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/page.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/fs/proc/page.c
===================================================================
--- linux.orig/fs/proc/page.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/fs/proc/page.c	2009-05-27 21:14:21.000000000 +0200
@@ -79,6 +79,7 @@
 #define KPF_WRITEBACK  8
 #define KPF_RECLAIM    9
 #define KPF_BUDDY     10
+#define KPF_HWPOISON  11
 
 #define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
 
@@ -118,6 +119,9 @@
 			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
+#ifdef CONFIG_MEMORY_FAILURE
+		uflags |= kpf_copy_bit(kflags, KPF_HWPOISON, PG_hwpoison);
+#endif
 
 		if (put_user(uflags, out++)) {
 			ret = -EFAULT;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-27 21:19:19.000000000 +0200
@@ -118,6 +118,12 @@
 }
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-26 22:15:37.000000000 +0200
+++ linux/mm/rmap.c	2009-05-27 21:19:19.000000000 +0200
@@ -191,7 +191,7 @@
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-27 21:19:19.000000000 +0200
@@ -118,6 +118,12 @@
 }
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-26 22:15:37.000000000 +0200
+++ linux/mm/rmap.c	2009-05-27 21:19:19.000000000 +0200
@@ -191,7 +191,7 @@
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


CPU migration uses special swap entry types to trigger special actions on page
faults. Extend this mechanism to also support poisoned swap entries, to trigger
poison handling on page faults. This allows followon patches to prevent 
processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/swap.h	2009-05-27 21:14:21.000000000 +0200
@@ -34,16 +34,38 @@
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/swapops.h	2009-05-27 21:14:21.000000000 +0200
@@ -131,3 +131,41 @@
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) > MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/swapfile.c	2009-05-27 21:14:21.000000000 +0200
@@ -579,7 +579,7 @@
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
 	unsigned long offset, type;
 	int result = 0;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	type = swp_type(entry);

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


CPU migration uses special swap entry types to trigger special actions on page
faults. Extend this mechanism to also support poisoned swap entries, to trigger
poison handling on page faults. This allows followon patches to prevent 
processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/swap.h	2009-05-27 21:14:21.000000000 +0200
@@ -34,16 +34,38 @@
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/swapops.h	2009-05-27 21:14:21.000000000 +0200
@@ -131,3 +131,41 @@
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) > MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/swapfile.c	2009-05-27 21:14:21.000000000 +0200
@@ -579,7 +579,7 @@
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
 	unsigned long offset, type;
 	int result = 0;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	type = swp_type(entry);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/asm-generic/siginfo.h	2009-05-27 21:14:21.000000000 +0200
@@ -82,6 +82,7 @@
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/asm-generic/siginfo.h	2009-05-27 21:14:21.000000000 +0200
@@ -82,6 +82,7 @@
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   18 +++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/memory.c	2009-05-27 21:19:19.000000000 +0200
@@ -1315,7 +1315,8 @@
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2459,8 +2460,15 @@
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_hwpoison_entry(entry)) {
+			ret = VM_FAULT_HWPOISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2484,6 +2492,10 @@
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+		goto out;
 	}
 
 	lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/mm.h	2009-05-27 21:19:18.000000000 +0200
@@ -702,11 +702,12 @@
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   18 +++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/memory.c	2009-05-27 21:19:19.000000000 +0200
@@ -1315,7 +1315,8 @@
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2459,8 +2460,15 @@
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_hwpoison_entry(entry)) {
+			ret = VM_FAULT_HWPOISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2484,6 +2492,10 @@
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+		goto out;
 	}
 
 	lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/linux/mm.h	2009-05-27 21:19:18.000000000 +0200
@@ -702,11 +702,12 @@
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Bail out early when hardware poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly into processes,
because that would cause another (potentially deadly) machine check

This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/memory.c	2009-05-27 21:14:21.000000000 +0200
@@ -2659,6 +2659,9 @@
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
 		return ret;
 
+	if (unlikely(PageHWPoison(vmf.page)))
+		return VM_FAULT_HWPOISON;
+
 	/*
 	 * For consistency in subsequent calls, make the faulted page always
 	 * locked.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Bail out early when hardware poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly into processes,
because that would cause another (potentially deadly) machine check

This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/memory.c	2009-05-27 21:14:21.000000000 +0200
@@ -2659,6 +2659,9 @@
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
 		return ret;
 
+	if (unlikely(PageHWPoison(vmf.page)))
+		return VM_FAULT_HWPOISON;
+
 	/*
 	 * For consistency in subsequent calls, make the faulted page always
 	 * locked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-05-27 21:19:18.000000000 +0200
@@ -189,6 +189,7 @@
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -827,10 +828,12 @@
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -846,7 +849,14 @@
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_HWPOISON) {
+		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+			tsk->comm, tsk->pid);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -856,8 +866,8 @@
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-05-27 21:19:18.000000000 +0200
@@ -189,6 +189,7 @@
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -827,10 +828,12 @@
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -846,7 +849,14 @@
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_HWPOISON) {
+		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+			tsk->comm, tsk->pid);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -856,8 +866,8 @@
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: Lee.Schermerhorn, npiggin, akpm, linux-kernel, linux-mm, fengguang.wu


try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so 
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to 
do.

This patch is supposed to be a nop in behaviour. If anyone can prove 
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-27 21:19:18.000000000 +0200
@@ -84,7 +84,19 @@
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/rmap.c	2009-05-27 21:19:18.000000000 +0200
@@ -755,7 +755,7 @@
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -777,11 +777,13 @@
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -821,12 +823,12 @@
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1040,8 +1043,7 @@
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1102,7 +1105,8 @@
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1222,8 +1226,8 @@
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 #endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/vmscan.c	2009-05-27 21:14:21.000000000 +0200
@@ -666,7 +666,7 @@
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/migrate.c	2009-05-27 21:14:21.000000000 +0200
@@ -669,7 +669,7 @@
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: Lee.Schermerhorn, npiggin, akpm, linux-kernel, linux-mm, fengguang.wu


try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so 
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to 
do.

This patch is supposed to be a nop in behaviour. If anyone can prove 
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-27 21:19:18.000000000 +0200
@@ -84,7 +84,19 @@
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/rmap.c	2009-05-27 21:19:18.000000000 +0200
@@ -755,7 +755,7 @@
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -777,11 +777,13 @@
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -821,12 +823,12 @@
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1040,8 +1043,7 @@
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1102,7 +1105,8 @@
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1222,8 +1226,8 @@
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 #endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/vmscan.c	2009-05-27 21:14:21.000000000 +0200
@@ -666,7 +666,7 @@
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/migrate.c	2009-05-27 21:14:21.000000000 +0200
@@ -669,7 +669,7 @@
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


When a page has the poison bit set replace the PTE with a poison entry. 
This causes the right error handling to be done later when a process runs 
into it.

Also add a new flag to not do that (needed for the memory-failure handler
later)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
@@ -801,7 +801,14 @@
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_hwpoison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
@@ -93,6 +93,7 @@
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


When a page has the poison bit set replace the PTE with a poison entry. 
This causes the right error handling to be done later when a process runs 
into it.

Also add a new flag to not do that (needed for the memory-failure handler
later)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
@@ -801,7 +801,14 @@
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_hwpoison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
@@ -93,6 +93,7 @@
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty()
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page-writeback.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2009-05-26 22:15:37.000000000 +0200
+++ linux/mm/page-writeback.c	2009-05-27 21:14:21.000000000 +0200
@@ -1277,6 +1277,10 @@
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (unlikely(PageHWPoison(page))) {
+		SetPageDirty(page);
+		return 0;
+	}
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
 #ifdef CONFIG_BLOCK

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty()
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page-writeback.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2009-05-26 22:15:37.000000000 +0200
+++ linux/mm/page-writeback.c	2009-05-27 21:14:21.000000000 +0200
@@ -1277,6 +1277,10 @@
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (unlikely(PageHWPoison(page))) {
+		SetPageDirty(page);
+		return 0;
+	}
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
 #ifdef CONFIG_BLOCK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: fengguang.wu, akpm, linux-kernel, linux-mm, fengguang.wu


From: Wu Fengguang <fengguang.wu@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |   22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/page_alloc.c	2009-05-27 21:14:21.000000000 +0200
@@ -633,12 +633,22 @@
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
-		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
-		bad_page(page);
-		return 1;
+	int i;
+
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+
+		if (unlikely(page_mapcount(p) |
+			(p->mapping != NULL)  |
+			(page_count(p) != 0)  |
+			(p->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
+			/*
+			 * The whole array of pages will be dropped,
+			 * hopefully this is a rare and abnormal event.
+			 */
+			bad_page(p);
+			return 1;
+		}
 	}
 
 	set_page_private(page, 0);

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: fengguang.wu, akpm, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |   22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/page_alloc.c	2009-05-27 21:14:21.000000000 +0200
@@ -633,12 +633,22 @@
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
-		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
-		bad_page(page);
-		return 1;
+	int i;
+
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+
+		if (unlikely(page_mapcount(p) |
+			(p->mapping != NULL)  |
+			(page_count(p) != 0)  |
+			(p->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
+			/*
+			 * The whole array of pages will be dropped,
+			 * hopefully this is a rare and abnormal event.
+			 */
+			bad_page(p);
+			return 1;
+		}
 	}
 
 	set_page_private(page, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: hugh, npiggin, riel, akpm, chris.mason, akpm, linux-kernel,
	linux-mm, fengguang.wu


This patch adds the high level memory handler that poisons pages
that got corrupted by hardware (typically by a bit flip in a DIMM
or a cache) on the Linux level. Linux tries to access these
pages in the future then.

It is portable code and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before 
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used 
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep 
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu: 
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)

Cc: hugh@veritas.com
Cc: npiggin@suse.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: chris.mason@oracle.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>

---
 Documentation/sysctl/vm.txt |   21 +
 arch/x86/mm/fault.c         |    5 
 fs/proc/meminfo.c           |    9 
 include/linux/mm.h          |    4 
 kernel/sysctl.c             |   14 
 mm/Kconfig                  |    3 
 mm/Makefile                 |    1 
 mm/memory-failure.c         |  677 ++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 730 insertions(+), 4 deletions(-)

Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-05-27 21:23:18.000000000 +0200
+++ linux/mm/Makefile	2009-05-27 21:24:39.000000000 +0200
@@ -38,3 +38,4 @@
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c	2009-05-27 21:28:19.000000000 +0200
@@ -0,0 +1,677 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states.	The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ *   + left over references when process catches signal?
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+		pfn, t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	/*
+	 * Don't use force here, it's convenient if the signal
+	 * can be temporarily blocked.
+	 * This could cause a loop when the user sets SIGBUS
+	 * to SIG_IGN, but hopefully noone will do that?
+	 */
+	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.	We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+		       struct vm_area_struct *vma,
+		       struct list_head *to_kill,
+		       struct to_kill **tkc)
+{
+	int fail = 0;
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	if (tk->addr == -EFAULT) {
+		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
+		tk->addr = 0;
+		fail = 1;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+			  int fail, unsigned long pfn)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. So reset
+			 * the signal handlers
+			 */
+			if (fail)
+				flush_signal_handlers(tk->tsk, 1);
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			if (kill_proc_ao(tk->tsk, tk->addr, trapno, pfn) < 0)
+				printk(KERN_ERR
+		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av = page_lock_anon_vma(page);
+
+	if (av == NULL)	/* Not actually mapped anymore */
+		return;
+
+	read_lock(&tasklist_lock);
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	page_unlock_anon_vma(av);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page_mapping(page);
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+	/* memory allocation failure is implicitly handled */
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,
+	DELAYED,
+	IGNORED,
+	RECOVERED,
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	if (page_has_private(p))
+		do_invalidatepage(p, 0);
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			page_to_pfn(p));
+
+	/*
+	 * remove_from_page_cache assumes (mapping && !mapped)
+	 */
+	if (page_mapping(p) && !page_mapped(p)) {
+		remove_from_page_cache(p);
+		page_cache_release(p);
+	}
+
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
+			page_to_pfn(p));
+	if (mapping) {
+		/*
+		 * Truncate does the same, but we're not quite the same
+		 * as truncate. Needs more checking, but keep it for now.
+		 */
+		cancel_dirty_page(p, PAGE_CACHE_SIZE);
+
+		/*
+		 * IO error will be reported by write(), fsync(), etc.
+		 * who check the mapping.
+		 */
+		mapping_set_error(mapping, EIO);
+	}
+
+	me_pagecache_clean(p);
+
+	/*
+	 * Did the earlier release work?
+	 */
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		return FAILED;
+
+	return RECOVERED;
+}
+
+/*
+ * Clean and dirty swap cache.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+	ClearPageDirty(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p)
+{
+	ClearPageUptodate(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	delete_from_swap_cache(p);
+
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty		(1UL << PG_dirty)
+#define swapcache	(1UL << PG_swapcache)
+#define unevict		(1UL << PG_unevictable)
+#define mlocked		(1UL << PG_mlocked)
+#define writeback	(1UL << PG_writeback)
+#define lru		(1UL << PG_lru)
+#define swapbacked	(1UL << PG_swapbacked)
+#define head		(1UL << PG_head)
+#define tail		(1UL << PG_tail)
+#define compound	(1UL << PG_compound)
+#define slab		(1UL << PG_slab)
+#define buddy		(1UL << PG_buddy)
+#define reserved	(1UL << PG_reserved)
+
+/*
+ * The table is > 80 columns because all the alternatvies were much worse.
+ */
+
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p);
+} error_states[] = {
+	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ buddy,	buddy,		"free kernel",		me_free },
+
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ slab,			slab,		"kernel slab",		me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ head,			head,		"hugetlb",		me_huge_page },
+	{ tail,			tail,		"hugetlb",		me_huge_page },
+#else
+	{ compound,		compound,	"hugetlb",		me_huge_page },
+#endif
+
+	{ swapcache|dirty,	swapcache|dirty,"dirty swapcache",	me_swapcache_dirty },
+	{ swapcache|dirty,	swapcache,	"clean swapcache",	me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ unevict|dirty,	unevict|dirty,	"unevictable dirty lru", me_pagecache_dirty },
+	{ unevict,		unevict,	"unevictable lru",	me_pagecache_clean },
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ mlocked|dirty,	mlocked|dirty,	"mlocked dirty lru",	me_pagecache_dirty },
+	{ mlocked,		mlocked,	"mlocked lru",		me_pagecache_clean },
+#endif
+
+	{ lru|dirty,		lru|dirty,	"dirty lru",		me_pagecache_dirty },
+	{ lru|dirty,		lru,		"clean lru",		me_pagecache_clean },
+	{ swapbacked,		swapbacked,	"anonymous",		me_pagecache_clean },
+
+	/*
+	 * Add more states here.
+	 */
+
+	/*
+	 * Catchall entry: must be at end.
+	 */
+	{ 0,			0,		"unknown page state",	me_unknown },
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+			unsigned long pfn)
+{
+	int ret;
+
+	printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
+	ret = action(p);
+	printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
+	       pfn, msg, action_name[ret]);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+		       "MCE %#lx: %s page still referenced by %d users\n",
+		       pfn, msg, page_count(p) - 1);
+
+	/* Could do more checks here if page looks ok */
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+static void hwpoison_page_prepare(struct page *p, unsigned long pfn,
+				  int trapno)
+{
+	enum ttu_flags ttu = TTU_UNMAP| TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	int kill = sysctl_memory_failure_early_kill;
+	struct address_space *mapping;
+	LIST_HEAD(tokill);
+	int ret;
+	int i;
+
+	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+		return;
+
+	if (!PageLRU(p))
+		lru_add_drain();
+
+	/*
+	 * This check implies we don't kill processes if their pages
+	 * are in the swap cache early. Those are always late kills.
+	 */
+	if (!page_mapped(p))
+		return;
+
+	if (PageSwapCache(p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		ttu |= TTU_IGNORE_HWPOISON;
+	}
+
+	/*
+	 * Poisoned clean file pages are harmless, the
+	 * data can be restored by regular page faults.
+	 */
+	mapping = page_mapping(p);
+	if (!PageDirty(p) && !PageWriteback(p) &&
+	    !PageAnon(p) && !PageSwapBacked(p) &&
+	    mapping && mapping_cap_account_dirty(mapping)) {
+		if (page_mkclean(p))
+			SetPageDirty(p);
+		else {
+			kill = 0;
+			ttu |= TTU_IGNORE_HWPOISON;
+		}
+	}
+
+	/*
+	 * First collect all the processes that have the page
+	 * mapped.  This has to be done before try_to_unmap,
+	 * because ttu takes the rmap data structures down.
+	 *
+	 * This also has the side effect to propagate the dirty
+	 * bit from PTEs into the struct page. This is needed
+	 * to actually decide if something needs to be killed
+	 * or errored, or if it's ok to just drop the page.
+	 *
+	 * Error handling: We ignore errors here because
+	 * there's nothing that can be done.
+	 *
+	 * RED-PEN some cases in process exit seem to deadlock
+	 * on the page lock. drop it or add poison checks?
+	 */
+	if (kill)
+		collect_procs(p, &tokill);
+
+	/*
+	 * try_to_unmap can fail temporarily due to races.
+	 * Try a few times (RED-PEN better strategy?)
+	 */
+	for (i = 0; i < N_UNMAP_TRIES; i++) {
+		ret = try_to_unmap(p, ttu);
+		if (ret == SWAP_SUCCESS)
+			break;
+		Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+	}
+
+	/*
+	 * Now that the dirty bit has been propagated to the
+	 * struct page and all unmaps done we can decide if
+	 * killing is needed or not.  Only kill when the page
+	 * was dirty, otherwise the tokill list is merely
+	 * freed.  When there was a problem unmapping earlier
+	 * use a more force-full uncatchable kill to prevent
+	 * any accesses to the poisoned memory.
+	 */
+	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+		      ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	struct page_state *ps;
+	struct page *p;
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+   "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
+		       pfn);
+		return;
+	}
+
+
+	p = pfn_to_page(pfn);
+	if (TestSetPageHWPoison(p)) {
+		printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
+		return;
+	}
+
+	/*
+	 * We need/can do nothing about count=0 pages.
+	 * 1) it's a free page, and therefore in safe hand:
+	 *    prep_new_page() will be the gate keeper.
+	 * 2) it's part of a non-compound high order page.
+	 *    Implies some kernel user: cannot stop them from
+	 *    R/W the page; let's pray that the page has been
+	 *    used and will be freed some time later.
+	 * In fact it's dangerous to directly bump up page count from 0,
+	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+	 */
+	if (!get_page_unless_zero(compound_head(p))) {
+		printk(KERN_ERR
+		       "MCE 0x%lx: ignoring free or high order page\n", pfn);
+		return;
+	}
+
+	lock_page_nosync(p);
+	hwpoison_page_prepare(p, pfn, trapno);
+
+	/* Tored down by someone else? */
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+		printk(KERN_ERR
+		       "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
+		goto out;
+	}
+
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) == ps->res) {
+			page_action(ps->msg, p, ps->action, pfn);
+			break;
+		}
+	}
+out:
+	unlock_page(p);
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
+++ linux/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
@@ -1322,6 +1322,10 @@
 
 extern void *alloc_locked_buffer(size_t size);
 extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
 extern void release_locked_buffer(void *buffer, size_t size);
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2009-05-27 21:23:18.000000000 +0200
+++ linux/kernel/sysctl.c	2009-05-27 21:24:39.000000000 +0200
@@ -1282,6 +1282,20 @@
 		.proc_handler	= &scan_unevictable_handler,
 	},
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "memory_failure_early_kill",
+		.data		= &sysctl_memory_failure_early_kill,
+		.maxlen		= sizeof(vm_highmem_is_dirtyable),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c	2009-05-27 21:23:18.000000000 +0200
+++ linux/fs/proc/meminfo.c	2009-05-27 21:24:39.000000000 +0200
@@ -97,7 +97,11 @@
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"BadPages:       %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -144,6 +148,9 @@
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-27 21:23:18.000000000 +0200
+++ linux/mm/Kconfig	2009-05-27 21:24:39.000000000 +0200
@@ -226,6 +226,9 @@
 config MMU_NOTIFIER
 	bool
 
+config MEMORY_FAILURE
+	bool
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
Index: linux/Documentation/sysctl/vm.txt
===================================================================
--- linux.orig/Documentation/sysctl/vm.txt	2009-05-27 21:23:18.000000000 +0200
+++ linux/Documentation/sysctl/vm.txt	2009-05-27 21:24:39.000000000 +0200
@@ -32,6 +32,7 @@
 - legacy_va_layout
 - lowmem_reserve_ratio
 - max_map_count
+- memory_failure_early_kill
 - min_free_kbytes
 - min_slab_ratio
 - min_unmapped_ratio
@@ -53,7 +54,6 @@
 - vfs_cache_pressure
 - zone_reclaim_mode
 
-
 ==============================================================
 
 block_dump
@@ -275,6 +275,25 @@
 
 The default value is 65536.
 
+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware.
+
+1: Kill all processes that have the corrupted page mapped as soon as the
+corruption is detected.
+
+0: Only unmap the page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS, so processes can handle this
+if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
 ==============================================================
 
 min_free_kbytes:
Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
@@ -851,8 +851,9 @@
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (fault & VM_FAULT_HWPOISON) {
-		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
-			tsk->comm, tsk->pid);
+		printk(KERN_ERR
+       "MCE: Killing %s:%d for accessing hardware corrupted memory at %#lx\n",
+			tsk->comm, tsk->pid, address);
 		code = BUS_MCEERR_AR;
 	}
 #endif

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: hugh, npiggin, riel, akpm, chris.mason


This patch adds the high level memory handler that poisons pages
that got corrupted by hardware (typically by a bit flip in a DIMM
or a cache) on the Linux level. Linux tries to access these
pages in the future then.

It is portable code and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before 
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used 
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep 
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu: 
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)

Cc: hugh@veritas.com
Cc: npiggin@suse.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: chris.mason@oracle.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>

---
 Documentation/sysctl/vm.txt |   21 +
 arch/x86/mm/fault.c         |    5 
 fs/proc/meminfo.c           |    9 
 include/linux/mm.h          |    4 
 kernel/sysctl.c             |   14 
 mm/Kconfig                  |    3 
 mm/Makefile                 |    1 
 mm/memory-failure.c         |  677 ++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 730 insertions(+), 4 deletions(-)

Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-05-27 21:23:18.000000000 +0200
+++ linux/mm/Makefile	2009-05-27 21:24:39.000000000 +0200
@@ -38,3 +38,4 @@
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c	2009-05-27 21:28:19.000000000 +0200
@@ -0,0 +1,677 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states.	The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ *   + left over references when process catches signal?
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+		pfn, t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	/*
+	 * Don't use force here, it's convenient if the signal
+	 * can be temporarily blocked.
+	 * This could cause a loop when the user sets SIGBUS
+	 * to SIG_IGN, but hopefully noone will do that?
+	 */
+	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.	We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+		       struct vm_area_struct *vma,
+		       struct list_head *to_kill,
+		       struct to_kill **tkc)
+{
+	int fail = 0;
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	if (tk->addr == -EFAULT) {
+		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
+		tk->addr = 0;
+		fail = 1;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+			  int fail, unsigned long pfn)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. So reset
+			 * the signal handlers
+			 */
+			if (fail)
+				flush_signal_handlers(tk->tsk, 1);
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			if (kill_proc_ao(tk->tsk, tk->addr, trapno, pfn) < 0)
+				printk(KERN_ERR
+		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av = page_lock_anon_vma(page);
+
+	if (av == NULL)	/* Not actually mapped anymore */
+		return;
+
+	read_lock(&tasklist_lock);
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	page_unlock_anon_vma(av);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page_mapping(page);
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+	/* memory allocation failure is implicitly handled */
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,
+	DELAYED,
+	IGNORED,
+	RECOVERED,
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",
+	[DELAYED] = "Delayed",
+	[IGNORED] = "Ignored",
+	[RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	if (page_has_private(p))
+		do_invalidatepage(p, 0);
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			page_to_pfn(p));
+
+	/*
+	 * remove_from_page_cache assumes (mapping && !mapped)
+	 */
+	if (page_mapping(p) && !page_mapped(p)) {
+		remove_from_page_cache(p);
+		page_cache_release(p);
+	}
+
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
+			page_to_pfn(p));
+	if (mapping) {
+		/*
+		 * Truncate does the same, but we're not quite the same
+		 * as truncate. Needs more checking, but keep it for now.
+		 */
+		cancel_dirty_page(p, PAGE_CACHE_SIZE);
+
+		/*
+		 * IO error will be reported by write(), fsync(), etc.
+		 * who check the mapping.
+		 */
+		mapping_set_error(mapping, EIO);
+	}
+
+	me_pagecache_clean(p);
+
+	/*
+	 * Did the earlier release work?
+	 */
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		return FAILED;
+
+	return RECOVERED;
+}
+
+/*
+ * Clean and dirty swap cache.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+	ClearPageDirty(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p)
+{
+	ClearPageUptodate(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	delete_from_swap_cache(p);
+
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty		(1UL << PG_dirty)
+#define swapcache	(1UL << PG_swapcache)
+#define unevict		(1UL << PG_unevictable)
+#define mlocked		(1UL << PG_mlocked)
+#define writeback	(1UL << PG_writeback)
+#define lru		(1UL << PG_lru)
+#define swapbacked	(1UL << PG_swapbacked)
+#define head		(1UL << PG_head)
+#define tail		(1UL << PG_tail)
+#define compound	(1UL << PG_compound)
+#define slab		(1UL << PG_slab)
+#define buddy		(1UL << PG_buddy)
+#define reserved	(1UL << PG_reserved)
+
+/*
+ * The table is > 80 columns because all the alternatvies were much worse.
+ */
+
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p);
+} error_states[] = {
+	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ buddy,	buddy,		"free kernel",		me_free },
+
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ slab,			slab,		"kernel slab",		me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ head,			head,		"hugetlb",		me_huge_page },
+	{ tail,			tail,		"hugetlb",		me_huge_page },
+#else
+	{ compound,		compound,	"hugetlb",		me_huge_page },
+#endif
+
+	{ swapcache|dirty,	swapcache|dirty,"dirty swapcache",	me_swapcache_dirty },
+	{ swapcache|dirty,	swapcache,	"clean swapcache",	me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ unevict|dirty,	unevict|dirty,	"unevictable dirty lru", me_pagecache_dirty },
+	{ unevict,		unevict,	"unevictable lru",	me_pagecache_clean },
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ mlocked|dirty,	mlocked|dirty,	"mlocked dirty lru",	me_pagecache_dirty },
+	{ mlocked,		mlocked,	"mlocked lru",		me_pagecache_clean },
+#endif
+
+	{ lru|dirty,		lru|dirty,	"dirty lru",		me_pagecache_dirty },
+	{ lru|dirty,		lru,		"clean lru",		me_pagecache_clean },
+	{ swapbacked,		swapbacked,	"anonymous",		me_pagecache_clean },
+
+	/*
+	 * Add more states here.
+	 */
+
+	/*
+	 * Catchall entry: must be at end.
+	 */
+	{ 0,			0,		"unknown page state",	me_unknown },
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+			unsigned long pfn)
+{
+	int ret;
+
+	printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
+	ret = action(p);
+	printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
+	       pfn, msg, action_name[ret]);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+		       "MCE %#lx: %s page still referenced by %d users\n",
+		       pfn, msg, page_count(p) - 1);
+
+	/* Could do more checks here if page looks ok */
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+static void hwpoison_page_prepare(struct page *p, unsigned long pfn,
+				  int trapno)
+{
+	enum ttu_flags ttu = TTU_UNMAP| TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	int kill = sysctl_memory_failure_early_kill;
+	struct address_space *mapping;
+	LIST_HEAD(tokill);
+	int ret;
+	int i;
+
+	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+		return;
+
+	if (!PageLRU(p))
+		lru_add_drain();
+
+	/*
+	 * This check implies we don't kill processes if their pages
+	 * are in the swap cache early. Those are always late kills.
+	 */
+	if (!page_mapped(p))
+		return;
+
+	if (PageSwapCache(p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		ttu |= TTU_IGNORE_HWPOISON;
+	}
+
+	/*
+	 * Poisoned clean file pages are harmless, the
+	 * data can be restored by regular page faults.
+	 */
+	mapping = page_mapping(p);
+	if (!PageDirty(p) && !PageWriteback(p) &&
+	    !PageAnon(p) && !PageSwapBacked(p) &&
+	    mapping && mapping_cap_account_dirty(mapping)) {
+		if (page_mkclean(p))
+			SetPageDirty(p);
+		else {
+			kill = 0;
+			ttu |= TTU_IGNORE_HWPOISON;
+		}
+	}
+
+	/*
+	 * First collect all the processes that have the page
+	 * mapped.  This has to be done before try_to_unmap,
+	 * because ttu takes the rmap data structures down.
+	 *
+	 * This also has the side effect to propagate the dirty
+	 * bit from PTEs into the struct page. This is needed
+	 * to actually decide if something needs to be killed
+	 * or errored, or if it's ok to just drop the page.
+	 *
+	 * Error handling: We ignore errors here because
+	 * there's nothing that can be done.
+	 *
+	 * RED-PEN some cases in process exit seem to deadlock
+	 * on the page lock. drop it or add poison checks?
+	 */
+	if (kill)
+		collect_procs(p, &tokill);
+
+	/*
+	 * try_to_unmap can fail temporarily due to races.
+	 * Try a few times (RED-PEN better strategy?)
+	 */
+	for (i = 0; i < N_UNMAP_TRIES; i++) {
+		ret = try_to_unmap(p, ttu);
+		if (ret == SWAP_SUCCESS)
+			break;
+		Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+	}
+
+	/*
+	 * Now that the dirty bit has been propagated to the
+	 * struct page and all unmaps done we can decide if
+	 * killing is needed or not.  Only kill when the page
+	 * was dirty, otherwise the tokill list is merely
+	 * freed.  When there was a problem unmapping earlier
+	 * use a more force-full uncatchable kill to prevent
+	 * any accesses to the poisoned memory.
+	 */
+	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+		      ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	struct page_state *ps;
+	struct page *p;
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+   "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
+		       pfn);
+		return;
+	}
+
+
+	p = pfn_to_page(pfn);
+	if (TestSetPageHWPoison(p)) {
+		printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
+		return;
+	}
+
+	/*
+	 * We need/can do nothing about count=0 pages.
+	 * 1) it's a free page, and therefore in safe hand:
+	 *    prep_new_page() will be the gate keeper.
+	 * 2) it's part of a non-compound high order page.
+	 *    Implies some kernel user: cannot stop them from
+	 *    R/W the page; let's pray that the page has been
+	 *    used and will be freed some time later.
+	 * In fact it's dangerous to directly bump up page count from 0,
+	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+	 */
+	if (!get_page_unless_zero(compound_head(p))) {
+		printk(KERN_ERR
+		       "MCE 0x%lx: ignoring free or high order page\n", pfn);
+		return;
+	}
+
+	lock_page_nosync(p);
+	hwpoison_page_prepare(p, pfn, trapno);
+
+	/* Tored down by someone else? */
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+		printk(KERN_ERR
+		       "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
+		goto out;
+	}
+
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) == ps->res) {
+			page_action(ps->msg, p, ps->action, pfn);
+			break;
+		}
+	}
+out:
+	unlock_page(p);
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
+++ linux/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
@@ -1322,6 +1322,10 @@
 
 extern void *alloc_locked_buffer(size_t size);
 extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
 extern void release_locked_buffer(void *buffer, size_t size);
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2009-05-27 21:23:18.000000000 +0200
+++ linux/kernel/sysctl.c	2009-05-27 21:24:39.000000000 +0200
@@ -1282,6 +1282,20 @@
 		.proc_handler	= &scan_unevictable_handler,
 	},
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "memory_failure_early_kill",
+		.data		= &sysctl_memory_failure_early_kill,
+		.maxlen		= sizeof(vm_highmem_is_dirtyable),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c	2009-05-27 21:23:18.000000000 +0200
+++ linux/fs/proc/meminfo.c	2009-05-27 21:24:39.000000000 +0200
@@ -97,7 +97,11 @@
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"BadPages:       %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -144,6 +148,9 @@
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-27 21:23:18.000000000 +0200
+++ linux/mm/Kconfig	2009-05-27 21:24:39.000000000 +0200
@@ -226,6 +226,9 @@
 config MMU_NOTIFIER
 	bool
 
+config MEMORY_FAILURE
+	bool
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
Index: linux/Documentation/sysctl/vm.txt
===================================================================
--- linux.orig/Documentation/sysctl/vm.txt	2009-05-27 21:23:18.000000000 +0200
+++ linux/Documentation/sysctl/vm.txt	2009-05-27 21:24:39.000000000 +0200
@@ -32,6 +32,7 @@
 - legacy_va_layout
 - lowmem_reserve_ratio
 - max_map_count
+- memory_failure_early_kill
 - min_free_kbytes
 - min_slab_ratio
 - min_unmapped_ratio
@@ -53,7 +54,6 @@
 - vfs_cache_pressure
 - zone_reclaim_mode
 
-
 ==============================================================
 
 block_dump
@@ -275,6 +275,25 @@
 
 The default value is 65536.
 
+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware.
+
+1: Kill all processes that have the corrupted page mapped as soon as the
+corruption is detected.
+
+0: Only unmap the page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS, so processes can handle this
+if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
 ==============================================================
 
 min_free_kbytes:
Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
@@ -851,8 +851,9 @@
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (fault & VM_FAULT_HWPOISON) {
-		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
-			tsk->comm, tsk->pid);
+		printk(KERN_ERR
+       "MCE: Killing %s:%d for accessing hardware corrupted memory at %#lx\n",
+			tsk->comm, tsk->pid, address);
 		code = BUS_MCEERR_AR;
 	}
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/Kconfig	2009-05-27 21:19:16.000000000 +0200
@@ -228,6 +228,8 @@
 
 config MEMORY_FAILURE
 	bool
+	default y
+	depends on MMU
 
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/Kconfig	2009-05-27 21:19:16.000000000 +0200
@@ -228,6 +228,8 @@
 
 config MEMORY_FAILURE
 	bool
+	default y
+	depends on MMU
 
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman.h |    1 +
 mm/madvise.c               |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/madvise.c	2009-05-27 21:14:21.000000000 +0200
@@ -208,6 +208,38 @@
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+		 * RED-PEN page can be reused, but otherwise we'll have to fight with the
+		 * refcnt
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON)
+		return madvise_hwpoison(start, start+len_in);
+#endif
+
 	write = madvise_need_mmap_write(behavior);
 	if (write)
 		down_write(&current->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/asm-generic/mman.h	2009-05-27 21:14:21.000000000 +0200
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_HWPOISON	12		/* hw poison the page (root only) */
 
 /* compatibility flags */
 #define MAP_FILE	0

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman.h |    1 +
 mm/madvise.c               |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c	2009-05-27 21:13:54.000000000 +0200
+++ linux/mm/madvise.c	2009-05-27 21:14:21.000000000 +0200
@@ -208,6 +208,38 @@
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+		 * RED-PEN page can be reused, but otherwise we'll have to fight with the
+		 * refcnt
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON)
+		return madvise_hwpoison(start, start+len_in);
+#endif
+
 	write = madvise_need_mmap_write(behavior);
 	if (write)
 		down_write(&current->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h	2009-05-27 21:13:54.000000000 +0200
+++ linux/include/asm-generic/mman.h	2009-05-27 21:14:21.000000000 +0200
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_HWPOISON	12		/* hw poison the page (root only) */
 
 /* compatibility flags */
 #define MAP_FILE	0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  2009-05-27 20:12 ` Andi Kleen
@ 2009-05-27 20:12   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues: 

Should be disabled for cgroups.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig           |    4 ++++
 mm/Makefile          |    1 +
 mm/hwpoison-inject.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/hwpoison-inject.c	2009-05-27 21:14:21.000000000 +0200
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+	memory_failure(val, 18);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+	if (hwpoison_dir)
+		debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+	if (hwpoison_dir == NULL)
+		return -ENOMEM;
+	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir, NULL,
+			&hwpoison_fops);
+	if (corrupt_pfn == NULL) {
+		pfn_inject_exit();
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/Kconfig	2009-05-27 21:14:21.000000000 +0200
@@ -231,6 +231,10 @@
 	default y
 	depends on MMU
 
+config HWPOISON_INJECT
+	tristate "Poison pages injector"
+	depends on MEMORY_FAILURE && DEBUG_KERNEL
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/Makefile	2009-05-27 21:14:21.000000000 +0200
@@ -39,3 +39,4 @@
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o

^ permalink raw reply	[flat|nested] 232+ messages in thread

* [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
@ 2009-05-27 20:12   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues: 

Should be disabled for cgroups.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig           |    4 ++++
 mm/Makefile          |    1 +
 mm/hwpoison-inject.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/hwpoison-inject.c	2009-05-27 21:14:21.000000000 +0200
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+	memory_failure(val, 18);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+	if (hwpoison_dir)
+		debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+	if (hwpoison_dir == NULL)
+		return -ENOMEM;
+	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir, NULL,
+			&hwpoison_fops);
+	if (corrupt_pfn == NULL) {
+		pfn_inject_exit();
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/Kconfig	2009-05-27 21:14:21.000000000 +0200
@@ -231,6 +231,10 @@
 	default y
 	depends on MMU
 
+config HWPOISON_INJECT
+	tristate "Poison pages injector"
+	depends on MEMORY_FAILURE && DEBUG_KERNEL
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-05-27 21:14:21.000000000 +0200
+++ linux/mm/Makefile	2009-05-27 21:14:21.000000000 +0200
@@ -39,3 +39,4 @@
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-27 20:35     ` Larry H.
  -1 siblings, 0 replies; 232+ messages in thread
From: Larry H. @ 2009-05-27 20:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On 22:12 Wed 27 May     , Andi Kleen wrote:
> 
> Hardware poisoned pages need special handling in the VM and shouldn't be 
> touched again. This requires a new page flag. Define it here.
> 
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one.

I gave a look to your patchset and this is yet another case in which the
only way to truly control the allocation/release behavior at low level
(without intrusive approaches) is to -indeed- use a page flag.

If this gets merged I would like to ask Andrew and Christopher to look
at my recent memory sanitization patches. It seems the opinion about
adding new page flags isn't the same for everyone here.

	Larry


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-27 20:35     ` Larry H.
  0 siblings, 0 replies; 232+ messages in thread
From: Larry H. @ 2009-05-27 20:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On 22:12 Wed 27 May     , Andi Kleen wrote:
> 
> Hardware poisoned pages need special handling in the VM and shouldn't be 
> touched again. This requires a new page flag. Define it here.
> 
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one.

I gave a look to your patchset and this is yet another case in which the
only way to truly control the allocation/release behavior at low level
(without intrusive approaches) is to -indeed- use a page flag.

If this gets merged I would like to ask Andrew and Christopher to look
at my recent memory sanitization patches. It seems the opinion about
adding new page flags isn't the same for everyone here.

	Larry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-27 21:15     ` Alan Cox
  -1 siblings, 0 replies; 232+ messages in thread
From: Alan Cox @ 2009-05-27 21:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, 27 May 2009 22:12:26 +0200 (CEST)
Andi Kleen <andi@firstfloor.org> wrote:

> 
> Hardware poisoned pages need special handling in the VM and shouldn't be 
> touched again. This requires a new page flag. Define it here.

Why can't you use PG_reserved ? That already indicates the page may not
even be present (which is effectively your situation at that point).
Given lots of other hardware platforms we support bus error, machine
check, explode or do random undefined fun things when you touch pages
that don't exist I'm not sure I see why poisoned is different here ?

Alan

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-27 21:15     ` Alan Cox
  0 siblings, 0 replies; 232+ messages in thread
From: Alan Cox @ 2009-05-27 21:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, 27 May 2009 22:12:26 +0200 (CEST)
Andi Kleen <andi@firstfloor.org> wrote:

> 
> Hardware poisoned pages need special handling in the VM and shouldn't be 
> touched again. This requires a new page flag. Define it here.

Why can't you use PG_reserved ? That already indicates the page may not
even be present (which is effectively your situation at that point).
Given lots of other hardware platforms we support bus error, machine
check, explode or do random undefined fun things when you touch pages
that don't exist I'm not sure I see why poisoned is different here ?

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-28  7:27     ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28  7:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, May 27, 2009 at 10:12:35PM +0200, Andi Kleen wrote:
> 
> try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
> which are selected by magic flag variables. The logic is not very straight
> forward, because each of these flag change multiple behaviours (e.g.
> migration turns off aging, not only sets up migration ptes etc.)
> Also the different flags interact in magic ways.
> 
> A later patch in this series adds another mode to try_to_unmap, so 
> this becomes quickly unmanageable.
> 
> Replace the different flags with a action code (migration, munlock, munmap)
> and some additional flags as modifiers (ignore mlock, ignore aging).
> This makes the logic more straight forward and allows easier extension
> to new behaviours. Change all the caller to declare what they want to 
> do.
> 
> This patch is supposed to be a nop in behaviour. If anyone can prove 
> it is not that would be a bug.

Not a bad idea, but I would prefer to have a set of flags which tell
try_to_unmap what to do, and then combine them with #defines for
callers. Like gfp flags.

And just use regular bitops rather than this TTU_ACTION macro.

> 
> Cc: Lee.Schermerhorn@hp.com
> Cc: npiggin@suse.de
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/rmap.h |   14 +++++++++++++-
>  mm/migrate.c         |    2 +-
>  mm/rmap.c            |   40 ++++++++++++++++++++++------------------
>  mm/vmscan.c          |    2 +-
>  4 files changed, 37 insertions(+), 21 deletions(-)
> 
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
> +++ linux/include/linux/rmap.h	2009-05-27 21:19:18.000000000 +0200
> @@ -84,7 +84,19 @@
>   * Called from mm/vmscan.c to handle paging out
>   */
>  int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
> -int try_to_unmap(struct page *, int ignore_refs);
> +
> +enum ttu_flags {
> +	TTU_UNMAP = 0,			/* unmap mode */
> +	TTU_MIGRATION = 1,		/* migration mode */
> +	TTU_MUNLOCK = 2,		/* munlock mode */
> +	TTU_ACTION_MASK = 0xff,
> +
> +	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> +	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> +};
> +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> +
> +int try_to_unmap(struct page *, enum ttu_flags flags);
>  
>  /*
>   * Called from mm/filemap_xip.c to unmap empty zero page
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
> +++ linux/mm/rmap.c	2009-05-27 21:19:18.000000000 +0200
> @@ -755,7 +755,7 @@
>   * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
>   */
>  static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> -				int migration)
> +				enum ttu_flags flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> @@ -777,11 +777,13 @@
>  	 * If it's recently referenced (perhaps page_referenced
>  	 * skipped over this mm) then we should reactivate it.
>  	 */
> -	if (!migration) {
> +	if (!(flags & TTU_IGNORE_MLOCK)) {
>  		if (vma->vm_flags & VM_LOCKED) {
>  			ret = SWAP_MLOCK;
>  			goto out_unmap;
>  		}
> +	}
> +	if (!(flags & TTU_IGNORE_ACCESS)) {
>  		if (ptep_clear_flush_young_notify(vma, address, pte)) {
>  			ret = SWAP_FAIL;
>  			goto out_unmap;
> @@ -821,12 +823,12 @@
>  			 * pte. do_swap_page() will wait until the migration
>  			 * pte is removed and then restart fault handling.
>  			 */
> -			BUG_ON(!migration);
> +			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
>  			entry = make_migration_entry(page, pte_write(pteval));
>  		}
>  		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
>  		BUG_ON(pte_file(*pte));
> -	} else if (PAGE_MIGRATION && migration) {
> +	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
>  		/* Establish migration entry for a file page */
>  		swp_entry_t entry;
>  		entry = make_migration_entry(page, pte_write(pteval));
> @@ -995,12 +997,13 @@
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * 'LOCKED.
>   */
> -static int try_to_unmap_anon(struct page *page, int unlock, int migration)
> +static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>  {
>  	struct anon_vma *anon_vma;
>  	struct vm_area_struct *vma;
>  	unsigned int mlocked = 0;
>  	int ret = SWAP_AGAIN;
> +	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
>  
>  	if (MLOCK_PAGES && unlikely(unlock))
>  		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
> @@ -1016,7 +1019,7 @@
>  				continue;  /* must visit all unlocked vmas */
>  			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
>  		} else {
> -			ret = try_to_unmap_one(page, vma, migration);
> +			ret = try_to_unmap_one(page, vma, flags);
>  			if (ret == SWAP_FAIL || !page_mapped(page))
>  				break;
>  		}
> @@ -1040,8 +1043,7 @@
>  /**
>   * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
>   * @page: the page to unmap/unlock
> - * @unlock:  request for unlock rather than unmap [unlikely]
> - * @migration:  unmapping for migration - ignored if @unlock
> + * @flags: action and flags
>   *
>   * Find all the mappings of a page using the mapping pointer and the vma chains
>   * contained in the address_space struct it points to.
> @@ -1053,7 +1055,7 @@
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * 'LOCKED.
>   */
> -static int try_to_unmap_file(struct page *page, int unlock, int migration)
> +static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
>  {
>  	struct address_space *mapping = page->mapping;
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -1065,6 +1067,7 @@
>  	unsigned long max_nl_size = 0;
>  	unsigned int mapcount;
>  	unsigned int mlocked = 0;
> +	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
>  
>  	if (MLOCK_PAGES && unlikely(unlock))
>  		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
> @@ -1077,7 +1080,7 @@
>  				continue;	/* must visit all vmas */
>  			ret = SWAP_MLOCK;
>  		} else {
> -			ret = try_to_unmap_one(page, vma, migration);
> +			ret = try_to_unmap_one(page, vma, flags);
>  			if (ret == SWAP_FAIL || !page_mapped(page))
>  				goto out;
>  		}
> @@ -1102,7 +1105,8 @@
>  			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
>  			goto out;		/* no need to look further */
>  		}
> -		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
> +		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
> +			(vma->vm_flags & VM_LOCKED))
>  			continue;
>  		cursor = (unsigned long) vma->vm_private_data;
>  		if (cursor > max_nl_cursor)
> @@ -1136,7 +1140,7 @@
>  	do {
>  		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
>  						shared.vm_set.list) {
> -			if (!MLOCK_PAGES && !migration &&
> +			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
>  			    (vma->vm_flags & VM_LOCKED))
>  				continue;
>  			cursor = (unsigned long) vma->vm_private_data;
> @@ -1176,7 +1180,7 @@
>  /**
>   * try_to_unmap - try to remove all page table mappings to a page
>   * @page: the page to get unmapped
> - * @migration: migration flag
> + * @flags: action and flags
>   *
>   * Tries to remove all the page table entries which are mapping this
>   * page, used in the pageout path.  Caller must hold the page lock.
> @@ -1187,16 +1191,16 @@
>   * SWAP_FAIL	- the page is unswappable
>   * SWAP_MLOCK	- page is mlocked.
>   */
> -int try_to_unmap(struct page *page, int migration)
> +int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
>  	int ret;
>  
>  	BUG_ON(!PageLocked(page));
>  
>  	if (PageAnon(page))
> -		ret = try_to_unmap_anon(page, 0, migration);
> +		ret = try_to_unmap_anon(page, flags);
>  	else
> -		ret = try_to_unmap_file(page, 0, migration);
> +		ret = try_to_unmap_file(page, flags);
>  	if (ret != SWAP_MLOCK && !page_mapped(page))
>  		ret = SWAP_SUCCESS;
>  	return ret;
> @@ -1222,8 +1226,8 @@
>  	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
>  
>  	if (PageAnon(page))
> -		return try_to_unmap_anon(page, 1, 0);
> +		return try_to_unmap_anon(page, TTU_MUNLOCK);
>  	else
> -		return try_to_unmap_file(page, 1, 0);
> +		return try_to_unmap_file(page, TTU_MUNLOCK);
>  }
>  #endif
> Index: linux/mm/vmscan.c
> ===================================================================
> --- linux.orig/mm/vmscan.c	2009-05-27 21:13:54.000000000 +0200
> +++ linux/mm/vmscan.c	2009-05-27 21:14:21.000000000 +0200
> @@ -666,7 +666,7 @@
>  		 * processes. Try to unmap it here.
>  		 */
>  		if (page_mapped(page) && mapping) {
> -			switch (try_to_unmap(page, 0)) {
> +			switch (try_to_unmap(page, TTU_UNMAP)) {
>  			case SWAP_FAIL:
>  				goto activate_locked;
>  			case SWAP_AGAIN:
> Index: linux/mm/migrate.c
> ===================================================================
> --- linux.orig/mm/migrate.c	2009-05-27 21:13:54.000000000 +0200
> +++ linux/mm/migrate.c	2009-05-27 21:14:21.000000000 +0200
> @@ -669,7 +669,7 @@
>  	}
>  
>  	/* Establish migration ptes or remove ptes */
> -	try_to_unmap(page, 1);
> +	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>  
>  	if (!page_mapped(page))
>  		rc = move_to_new_page(newpage, page);

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-05-28  7:27     ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28  7:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, May 27, 2009 at 10:12:35PM +0200, Andi Kleen wrote:
> 
> try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
> which are selected by magic flag variables. The logic is not very straight
> forward, because each of these flag change multiple behaviours (e.g.
> migration turns off aging, not only sets up migration ptes etc.)
> Also the different flags interact in magic ways.
> 
> A later patch in this series adds another mode to try_to_unmap, so 
> this becomes quickly unmanageable.
> 
> Replace the different flags with a action code (migration, munlock, munmap)
> and some additional flags as modifiers (ignore mlock, ignore aging).
> This makes the logic more straight forward and allows easier extension
> to new behaviours. Change all the caller to declare what they want to 
> do.
> 
> This patch is supposed to be a nop in behaviour. If anyone can prove 
> it is not that would be a bug.

Not a bad idea, but I would prefer to have a set of flags which tell
try_to_unmap what to do, and then combine them with #defines for
callers. Like gfp flags.

And just use regular bitops rather than this TTU_ACTION macro.

> 
> Cc: Lee.Schermerhorn@hp.com
> Cc: npiggin@suse.de
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/rmap.h |   14 +++++++++++++-
>  mm/migrate.c         |    2 +-
>  mm/rmap.c            |   40 ++++++++++++++++++++++------------------
>  mm/vmscan.c          |    2 +-
>  4 files changed, 37 insertions(+), 21 deletions(-)
> 
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h	2009-05-27 21:14:21.000000000 +0200
> +++ linux/include/linux/rmap.h	2009-05-27 21:19:18.000000000 +0200
> @@ -84,7 +84,19 @@
>   * Called from mm/vmscan.c to handle paging out
>   */
>  int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
> -int try_to_unmap(struct page *, int ignore_refs);
> +
> +enum ttu_flags {
> +	TTU_UNMAP = 0,			/* unmap mode */
> +	TTU_MIGRATION = 1,		/* migration mode */
> +	TTU_MUNLOCK = 2,		/* munlock mode */
> +	TTU_ACTION_MASK = 0xff,
> +
> +	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> +	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> +};
> +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> +
> +int try_to_unmap(struct page *, enum ttu_flags flags);
>  
>  /*
>   * Called from mm/filemap_xip.c to unmap empty zero page
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c	2009-05-27 21:14:21.000000000 +0200
> +++ linux/mm/rmap.c	2009-05-27 21:19:18.000000000 +0200
> @@ -755,7 +755,7 @@
>   * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
>   */
>  static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> -				int migration)
> +				enum ttu_flags flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> @@ -777,11 +777,13 @@
>  	 * If it's recently referenced (perhaps page_referenced
>  	 * skipped over this mm) then we should reactivate it.
>  	 */
> -	if (!migration) {
> +	if (!(flags & TTU_IGNORE_MLOCK)) {
>  		if (vma->vm_flags & VM_LOCKED) {
>  			ret = SWAP_MLOCK;
>  			goto out_unmap;
>  		}
> +	}
> +	if (!(flags & TTU_IGNORE_ACCESS)) {
>  		if (ptep_clear_flush_young_notify(vma, address, pte)) {
>  			ret = SWAP_FAIL;
>  			goto out_unmap;
> @@ -821,12 +823,12 @@
>  			 * pte. do_swap_page() will wait until the migration
>  			 * pte is removed and then restart fault handling.
>  			 */
> -			BUG_ON(!migration);
> +			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
>  			entry = make_migration_entry(page, pte_write(pteval));
>  		}
>  		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
>  		BUG_ON(pte_file(*pte));
> -	} else if (PAGE_MIGRATION && migration) {
> +	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
>  		/* Establish migration entry for a file page */
>  		swp_entry_t entry;
>  		entry = make_migration_entry(page, pte_write(pteval));
> @@ -995,12 +997,13 @@
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * 'LOCKED.
>   */
> -static int try_to_unmap_anon(struct page *page, int unlock, int migration)
> +static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>  {
>  	struct anon_vma *anon_vma;
>  	struct vm_area_struct *vma;
>  	unsigned int mlocked = 0;
>  	int ret = SWAP_AGAIN;
> +	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
>  
>  	if (MLOCK_PAGES && unlikely(unlock))
>  		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
> @@ -1016,7 +1019,7 @@
>  				continue;  /* must visit all unlocked vmas */
>  			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
>  		} else {
> -			ret = try_to_unmap_one(page, vma, migration);
> +			ret = try_to_unmap_one(page, vma, flags);
>  			if (ret == SWAP_FAIL || !page_mapped(page))
>  				break;
>  		}
> @@ -1040,8 +1043,7 @@
>  /**
>   * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
>   * @page: the page to unmap/unlock
> - * @unlock:  request for unlock rather than unmap [unlikely]
> - * @migration:  unmapping for migration - ignored if @unlock
> + * @flags: action and flags
>   *
>   * Find all the mappings of a page using the mapping pointer and the vma chains
>   * contained in the address_space struct it points to.
> @@ -1053,7 +1055,7 @@
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * 'LOCKED.
>   */
> -static int try_to_unmap_file(struct page *page, int unlock, int migration)
> +static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
>  {
>  	struct address_space *mapping = page->mapping;
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -1065,6 +1067,7 @@
>  	unsigned long max_nl_size = 0;
>  	unsigned int mapcount;
>  	unsigned int mlocked = 0;
> +	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
>  
>  	if (MLOCK_PAGES && unlikely(unlock))
>  		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
> @@ -1077,7 +1080,7 @@
>  				continue;	/* must visit all vmas */
>  			ret = SWAP_MLOCK;
>  		} else {
> -			ret = try_to_unmap_one(page, vma, migration);
> +			ret = try_to_unmap_one(page, vma, flags);
>  			if (ret == SWAP_FAIL || !page_mapped(page))
>  				goto out;
>  		}
> @@ -1102,7 +1105,8 @@
>  			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
>  			goto out;		/* no need to look further */
>  		}
> -		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
> +		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
> +			(vma->vm_flags & VM_LOCKED))
>  			continue;
>  		cursor = (unsigned long) vma->vm_private_data;
>  		if (cursor > max_nl_cursor)
> @@ -1136,7 +1140,7 @@
>  	do {
>  		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
>  						shared.vm_set.list) {
> -			if (!MLOCK_PAGES && !migration &&
> +			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
>  			    (vma->vm_flags & VM_LOCKED))
>  				continue;
>  			cursor = (unsigned long) vma->vm_private_data;
> @@ -1176,7 +1180,7 @@
>  /**
>   * try_to_unmap - try to remove all page table mappings to a page
>   * @page: the page to get unmapped
> - * @migration: migration flag
> + * @flags: action and flags
>   *
>   * Tries to remove all the page table entries which are mapping this
>   * page, used in the pageout path.  Caller must hold the page lock.
> @@ -1187,16 +1191,16 @@
>   * SWAP_FAIL	- the page is unswappable
>   * SWAP_MLOCK	- page is mlocked.
>   */
> -int try_to_unmap(struct page *page, int migration)
> +int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
>  	int ret;
>  
>  	BUG_ON(!PageLocked(page));
>  
>  	if (PageAnon(page))
> -		ret = try_to_unmap_anon(page, 0, migration);
> +		ret = try_to_unmap_anon(page, flags);
>  	else
> -		ret = try_to_unmap_file(page, 0, migration);
> +		ret = try_to_unmap_file(page, flags);
>  	if (ret != SWAP_MLOCK && !page_mapped(page))
>  		ret = SWAP_SUCCESS;
>  	return ret;
> @@ -1222,8 +1226,8 @@
>  	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
>  
>  	if (PageAnon(page))
> -		return try_to_unmap_anon(page, 1, 0);
> +		return try_to_unmap_anon(page, TTU_MUNLOCK);
>  	else
> -		return try_to_unmap_file(page, 1, 0);
> +		return try_to_unmap_file(page, TTU_MUNLOCK);
>  }
>  #endif
> Index: linux/mm/vmscan.c
> ===================================================================
> --- linux.orig/mm/vmscan.c	2009-05-27 21:13:54.000000000 +0200
> +++ linux/mm/vmscan.c	2009-05-27 21:14:21.000000000 +0200
> @@ -666,7 +666,7 @@
>  		 * processes. Try to unmap it here.
>  		 */
>  		if (page_mapped(page) && mapping) {
> -			switch (try_to_unmap(page, 0)) {
> +			switch (try_to_unmap(page, TTU_UNMAP)) {
>  			case SWAP_FAIL:
>  				goto activate_locked;
>  			case SWAP_AGAIN:
> Index: linux/mm/migrate.c
> ===================================================================
> --- linux.orig/mm/migrate.c	2009-05-27 21:13:54.000000000 +0200
> +++ linux/mm/migrate.c	2009-05-27 21:14:21.000000000 +0200
> @@ -669,7 +669,7 @@
>  	}
>  
>  	/* Establish migration ptes or remove ptes */
> -	try_to_unmap(page, 1);
> +	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>  
>  	if (!page_mapped(page))
>  		rc = move_to_new_page(newpage, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-27 21:15     ` Alan Cox
@ 2009-05-28  7:54       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  7:54 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
> On Wed, 27 May 2009 22:12:26 +0200 (CEST)
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > 
> > Hardware poisoned pages need special handling in the VM and shouldn't be 
> > touched again. This requires a new page flag. Define it here.
> 
> Why can't you use PG_reserved ? That already indicates the page may not
> even be present (which is effectively your situation at that point).

Right now a page must be present with PG_reserved, otherwise /dev/mem, /proc/kcore
lots of other things will explode.

> Given lots of other hardware platforms we support bus error, machine
> check, explode or do random undefined fun things when you touch pages
> that don't exist I'm not sure I see why poisoned is different here ?

It's really a special case for lots of things and mixing it up with
PG_reserved is not very useful I think. Also page flags are not 
that tight a resource anymore anyways. I think it's better to have
it separated.

However I would expect that other architectures would use poisoned
pages too for their own similar issues. It's not really a x86 specific
concept.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-28  7:54       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  7:54 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
> On Wed, 27 May 2009 22:12:26 +0200 (CEST)
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > 
> > Hardware poisoned pages need special handling in the VM and shouldn't be 
> > touched again. This requires a new page flag. Define it here.
> 
> Why can't you use PG_reserved ? That already indicates the page may not
> even be present (which is effectively your situation at that point).

Right now a page must be present with PG_reserved, otherwise /dev/mem, /proc/kcore
lots of other things will explode.

> Given lots of other hardware platforms we support bus error, machine
> check, explode or do random undefined fun things when you touch pages
> that don't exist I'm not sure I see why poisoned is different here ?

It's really a special case for lots of things and mixing it up with
PG_reserved is not very useful I think. Also page flags are not 
that tight a resource anymore anyways. I think it's better to have
it separated.

However I would expect that other architectures would use poisoned
pages too for their own similar issues. It's not really a x86 specific
concept.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-28  7:27     ` Nick Piggin
@ 2009-05-28  8:03       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  8:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 09:27:03AM +0200, Nick Piggin wrote:
> Not a bad idea, but I would prefer to have a set of flags which tell
> try_to_unmap what to do, and then combine them with #defines for
> callers. Like gfp flags.

That's exactly what the patch does?

It just has actions and flags because the actions can be contradictory.

> And just use regular bitops rather than this TTU_ACTION macro.

TTU_ACTION does mask against multiple bits. None of the regular
bitops do that. 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-05-28  8:03       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  8:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 09:27:03AM +0200, Nick Piggin wrote:
> Not a bad idea, but I would prefer to have a set of flags which tell
> try_to_unmap what to do, and then combine them with #defines for
> callers. Like gfp flags.

That's exactly what the patch does?

It just has actions and flags because the actions can be contradictory.

> And just use regular bitops rather than this TTU_ACTION macro.

TTU_ACTION does mask against multiple bits. None of the regular
bitops do that. 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-28  8:26     ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28  8:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Wed, May 27, 2009 at 10:12:39PM +0200, Andi Kleen wrote:
> 
> This patch adds the high level memory handler that poisons pages
> that got corrupted by hardware (typically by a bit flip in a DIMM
> or a cache) on the Linux level. Linux tries to access these
> pages in the future then.

Quick review.

> Index: linux/mm/Makefile
> ===================================================================
> --- linux.orig/mm/Makefile	2009-05-27 21:23:18.000000000 +0200
> +++ linux/mm/Makefile	2009-05-27 21:24:39.000000000 +0200
> @@ -38,3 +38,4 @@
>  endif
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> Index: linux/mm/memory-failure.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/mm/memory-failure.c	2009-05-27 21:28:19.000000000 +0200
> @@ -0,0 +1,677 @@
> +/*
> + * Copyright (C) 2008, 2009 Intel Corporation
> + * Author: Andi Kleen
> + *
> + * This software may be redistributed and/or modified under the terms of
> + * the GNU General Public License ("GPL") version 2 only as published by the
> + * Free Software Foundation.
> + *
> + * High level machine check handler. Handles pages reported by the
> + * hardware as being corrupted usually due to a 2bit ECC memory or cache
> + * failure.
> + *
> + * This focuses on pages detected as corrupted in the background.
> + * When the current CPU tries to consume corruption the currently
> + * running process can just be killed directly instead. This implies
> + * that if the error cannot be handled for some reason it's safe to
> + * just ignore it because no corruption has been consumed yet. Instead
> + * when that happens another machine check will happen.
> + *
> + * Handles page cache pages in various states.	The tricky part
> + * here is that we can access any page asynchronous to other VM
> + * users, because memory failures could happen anytime and anywhere,
> + * possibly violating some of their assumptions. This is why this code
> + * has to be extremely careful. Generally it tries to use normal locking
> + * rules, as in get the standard locks, even if that means the
> + * error handling takes potentially a long time.
> + *
> + * The operation to map back from RMAP chains to processes has to walk
> + * the complete process list and has non linear complexity with the number
> + * mappings. In short it can be quite slow. But since memory corruptions
> + * are rare we hope to get away with this.
> + */
> +
> +/*
> + * Notebook:
> + * - hugetlb needs more code
> + * - nonlinear
> + * - remap races
> + * - anonymous (tinject):
> + *   + left over references when process catches signal?
> + * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
> + * - pass bad pages to kdump next kernel
> + */
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/page-flags.h>
> +#include <linux/sched.h>
> +#include <linux/rmap.h>
> +#include <linux/pagemap.h>
> +#include <linux/swap.h>
> +#include <linux/backing-dev.h>
> +#include "internal.h"
> +
> +#define Dprintk(x...) printk(x)
> +
> +int sysctl_memory_failure_early_kill __read_mostly = 1;
> +
> +atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
> +
> +/*
> + * Send all the processes who have the page mapped an ``action optional''
> + * signal.
> + */
> +static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
> +			unsigned long pfn)
> +{
> +	struct siginfo si;
> +	int ret;
> +
> +	printk(KERN_ERR
> +		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
> +		pfn, t->comm, t->pid);
> +	si.si_signo = SIGBUS;
> +	si.si_errno = 0;
> +	si.si_code = BUS_MCEERR_AO;
> +	si.si_addr = (void *)addr;
> +#ifdef __ARCH_SI_TRAPNO
> +	si.si_trapno = trapno;
> +#endif
> +	si.si_addr_lsb = PAGE_SHIFT;
> +	/*
> +	 * Don't use force here, it's convenient if the signal
> +	 * can be temporarily blocked.
> +	 * This could cause a loop when the user sets SIGBUS
> +	 * to SIG_IGN, but hopefully noone will do that?
> +	 */
> +	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
> +	if (ret < 0)
> +		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
> +		       t->comm, t->pid, ret);
> +	return ret;
> +}
> +
> +/*
> + * Kill all processes that have a poisoned page mapped and then isolate
> + * the page.
> + *
> + * General strategy:
> + * Find all processes having the page mapped and kill them.
> + * But we keep a page reference around so that the page is not
> + * actually freed yet.
> + * Then stash the page away
> + *
> + * There's no convenient way to get back to mapped processes
> + * from the VMAs. So do a brute-force search over all
> + * running processes.
> + *
> + * Remember that machine checks are not common (or rather
> + * if they are common you have other problems), so this shouldn't
> + * be a performance issue.
> + *
> + * Also there are some races possible while we get from the
> + * error detection to actually handle it.
> + */
> +
> +struct to_kill {
> +	struct list_head nd;
> +	struct task_struct *tsk;
> +	unsigned long addr;
> +};

It would be kinda nice to have a field in task_struct that is usable
say for anyone holding the tasklist lock for write. Then you could
make a list with them. But I guess it isn't worthwhile unless there
are other users.

> +
> +/*
> + * Failure handling: if we can't find or can't kill a process there's
> + * not much we can do.	We just print a message and ignore otherwise.
> + */
> +
> +/*
> + * Schedule a process for later kill.
> + * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> + * TBD would GFP_NOIO be enough?
> + */
> +static void add_to_kill(struct task_struct *tsk, struct page *p,
> +		       struct vm_area_struct *vma,
> +		       struct list_head *to_kill,
> +		       struct to_kill **tkc)
> +{
> +	int fail = 0;
> +	struct to_kill *tk;
> +
> +	if (*tkc) {
> +		tk = *tkc;
> +		*tkc = NULL;
> +	} else {
> +		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> +		if (!tk) {
> +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> +			return;
> +		}
> +	}
> +	tk->addr = page_address_in_vma(p, vma);
> +	if (tk->addr == -EFAULT) {
> +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");

I don't know if this is very helpful message. I could legitimately happen and
nothing anybody can do about it...

> +		tk->addr = 0;
> +		fail = 1;

Fail doesn't seem to be used anywhere.


> +	}
> +	get_task_struct(tsk);
> +	tk->tsk = tsk;
> +	list_add_tail(&tk->nd, to_kill);
> +}
> +
> +/*
> + * Kill the processes that have been collected earlier.
> + */
> +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> +			  int fail, unsigned long pfn)

I guess "doit" etc is obvious once reading the code and caller, but maybe a
quick comment in the header to describe?

> +{
> +	struct to_kill *tk, *next;
> +
> +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> +		if (doit) {
> +			/*
> +			 * In case something went wrong with munmaping
> +			 * make sure the process doesn't catch the
> +			 * signal and then access the memory. So reset
> +			 * the signal handlers
> +			 */
> +			if (fail)
> +				flush_signal_handlers(tk->tsk, 1);

Is this a legitimate thing to do? Is it racy? Why would you not send a
sigkill or something if you want them to die right now?


> +
> +			/*
> +			 * In theory the process could have mapped
> +			 * something else on the address in-between. We could
> +			 * check for that, but we need to tell the
> +			 * process anyways.
> +			 */
> +			if (kill_proc_ao(tk->tsk, tk->addr, trapno, pfn) < 0)
> +				printk(KERN_ERR
> +		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
> +					pfn, tk->tsk->comm, tk->tsk->pid);
> +		}
> +		put_task_struct(tk->tsk);
> +		kfree(tk);
> +	}
> +}
> +
> +/*
> + * Collect processes when the error hit an anonymous page.
> + */
> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct anon_vma *av = page_lock_anon_vma(page);
> +
> +	if (av == NULL)	/* Not actually mapped anymore */
> +		return;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process (tsk) {
> +		if (!tsk->mm)
> +			continue;
> +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +		}
> +	}
> +	page_unlock_anon_vma(av);
> +	read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect processes when the error hit a file mapped page.
> + */
> +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct prio_tree_iter iter;
> +	struct address_space *mapping = page_mapping(page);
> +
> +	read_lock(&tasklist_lock);
> +	spin_lock(&mapping->i_mmap_lock);

You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
lock. And anon_vma lock nests inside i_mmap_lock.

This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
type (maybe -rt kernels do it), then you could have a task holding
anon_vma lock and waiting for tasklist_lock, and another holding tasklist
lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
waiting for anon_vma lock.

I think nesting either inside or outside these locks consistently is less
fragile. Do we already have a dependency?... I don't know of one, but you
should document this in mm/rmap.c and mm/filemap.c.


> +	for_each_process(tsk) {
> +		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +
> +		if (!tsk->mm)
> +			continue;
> +
> +		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
> +				      pgoff)
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect the processes who have the corrupted page mapped to kill.
> + * This is done in two steps for locking reasons.
> + * First preallocate one tokill structure outside the spin locks,
> + * so that we can kill at least one process reasonably reliable.
> + */
> +static void collect_procs(struct page *page, struct list_head *tokill)
> +{
> +	struct to_kill *tk;
> +
> +	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
> +	/* memory allocation failure is implicitly handled */

Well... it's explicitly handled... in the callee ;)


> +	if (PageAnon(page))
> +		collect_procs_anon(page, tokill, &tk);
> +	else
> +		collect_procs_file(page, tokill, &tk);
> +	kfree(tk);
> +}
> +
> +/*
> + * Error handlers for various types of pages.
> + */
> +
> +enum outcome {
> +	FAILED,
> +	DELAYED,
> +	IGNORED,
> +	RECOVERED,
> +};
> +
> +static const char *action_name[] = {
> +	[FAILED] = "Failed",
> +	[DELAYED] = "Delayed",
> +	[IGNORED] = "Ignored",

How is delayed different to ignored (or failed, for that matter)?


> +	[RECOVERED] = "Recovered",

And what does recovered mean? THe processes were killed and the page taken
out of circulation, but the machine is still in some unknown state of corruption
henceforth, right?


> +};
> +
> +/*
> + * Error hit kernel page.
> + * Do nothing, try to be lucky and not touch this instead. For a few cases we
> + * could be more sophisticated.
> + */
> +static int me_kernel(struct page *p)
> +{
> +	return DELAYED;
> +}
> +
> +/*
> + * Already poisoned page.
> + */
> +static int me_ignore(struct page *p)
> +{
> +	return IGNORED;
> +}
> +
> +/*
> + * Page in unknown state. Do nothing.
> + */
> +static int me_unknown(struct page *p)
> +{
> +	printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
> +	return FAILED;
> +}
> +
> +/*
> + * Free memory
> + */
> +static int me_free(struct page *p)
> +{
> +	return DELAYED;
> +}
> +
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p)
> +{
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	if (page_has_private(p))
> +		do_invalidatepage(p, 0);
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> +			page_to_pfn(p));
> +
> +	/*
> +	 * remove_from_page_cache assumes (mapping && !mapped)
> +	 */
> +	if (page_mapping(p) && !page_mapped(p)) {
> +		remove_from_page_cache(p);
> +		page_cache_release(p);
> +	}

remove_mapping would probably be a better idea. Otherwise you can
probably introduce pagecache removal vs page fault races which
will make the kernel bug.


> +
> +	return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p)
> +{
> +	struct address_space *mapping = page_mapping(p);
> +
> +	SetPageError(p);
> +	/* TBD: print more information about the file. */
> +	printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
> +			page_to_pfn(p));
> +	if (mapping) {
> +		/*
> +		 * Truncate does the same, but we're not quite the same
> +		 * as truncate. Needs more checking, but keep it for now.
> +		 */

What's different about truncate? It would be good to reuse as much as possible.


> +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> +
> +		/*
> +		 * IO error will be reported by write(), fsync(), etc.
> +		 * who check the mapping.
> +		 */
> +		mapping_set_error(mapping, EIO);

Interesting. It's not *exactly* an IO error (well, not like one we're usually
used to).


> +	}
> +
> +	me_pagecache_clean(p);
> +
> +	/*
> +	 * Did the earlier release work?
> +	 */
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		return FAILED;
> +
> +	return RECOVERED;
> +}
> +
> +/*
> + * Clean and dirty swap cache.
> + */
> +static int me_swapcache_dirty(struct page *p)
> +{
> +	ClearPageDirty(p);
> +
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	return DELAYED;
> +}
> +
> +static int me_swapcache_clean(struct page *p)
> +{
> +	ClearPageUptodate(p);
> +
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	delete_from_swap_cache(p);
> +
> +	return RECOVERED;
> +}

All these handlers are quite interesting in that they need to
know about most of the mm. What are you trying to do in each
of them would be a good idea to say, and probably they should
rather go into their appropriate files instead of all here
(eg. swapcache stuff should go in mm/swap_state for example).

You haven't waited on writeback here AFAIKS, and have you
*really* verified it is safe to call delete_from_swap_cache?



> +/*
> + * Huge pages. Needs work.
> + * Issues:
> + * No rmap support so we cannot find the original mapper. In theory could walk
> + * all MMs and look for the mappings, but that would be non atomic and racy.
> + * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
> + * like just walking the current process and hoping it has it mapped (that
> + * should be usually true for the common "shared database cache" case)
> + * Should handle free huge pages and dequeue them too, but this needs to
> + * handle huge page accounting correctly.
> + */
> +static int me_huge_page(struct page *p)
> +{
> +	return FAILED;
> +}
> +
> +/*
> + * Various page states we can handle.
> + *
> + * A page state is defined by its current page->flags bits.
> + * The table matches them in order and calls the right handler.
> + *
> + * This is quite tricky because we can access page at any time
> + * in its live cycle, so all accesses have to be extremly careful.
> + *
> + * This is not complete. More states could be added.
> + * For any missing state don't attempt recovery.
> + */
> +
> +#define dirty		(1UL << PG_dirty)
> +#define swapcache	(1UL << PG_swapcache)
> +#define unevict		(1UL << PG_unevictable)
> +#define mlocked		(1UL << PG_mlocked)
> +#define writeback	(1UL << PG_writeback)
> +#define lru		(1UL << PG_lru)
> +#define swapbacked	(1UL << PG_swapbacked)
> +#define head		(1UL << PG_head)
> +#define tail		(1UL << PG_tail)
> +#define compound	(1UL << PG_compound)
> +#define slab		(1UL << PG_slab)
> +#define buddy		(1UL << PG_buddy)
> +#define reserved	(1UL << PG_reserved)

This looks like more work than just putting 1UL << (...) in each entry
in your table. Hmm, does this whole table thing even buy you much (versus a
much simpler switch statement?)

And seeing as you are doing a lot of checking for various page flags anyway,
(eg. in your prepare function). Just seems like needless complexity.

> +
> +/*
> + * The table is > 80 columns because all the alternatvies were much worse.
> + */
> +
> +static struct page_state {
> +	unsigned long mask;
> +	unsigned long res;
> +	char *msg;
> +	int (*action)(struct page *p);
> +} error_states[] = {
> +	{ reserved,	reserved,	"reserved kernel",	me_ignore },
> +	{ buddy,	buddy,		"free kernel",		me_free },
> +
> +	/*
> +	 * Could in theory check if slab page is free or if we can drop
> +	 * currently unused objects without touching them. But just
> +	 * treat it as standard kernel for now.
> +	 */
> +	{ slab,			slab,		"kernel slab",		me_kernel },
> +
> +#ifdef CONFIG_PAGEFLAGS_EXTENDED
> +	{ head,			head,		"hugetlb",		me_huge_page },
> +	{ tail,			tail,		"hugetlb",		me_huge_page },
> +#else
> +	{ compound,		compound,	"hugetlb",		me_huge_page },
> +#endif
> +
> +	{ swapcache|dirty,	swapcache|dirty,"dirty swapcache",	me_swapcache_dirty },
> +	{ swapcache|dirty,	swapcache,	"clean swapcache",	me_swapcache_clean },
> +
> +#ifdef CONFIG_UNEVICTABLE_LRU
> +	{ unevict|dirty,	unevict|dirty,	"unevictable dirty lru", me_pagecache_dirty },
> +	{ unevict,		unevict,	"unevictable lru",	me_pagecache_clean },
> +#endif
> +
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> +	{ mlocked|dirty,	mlocked|dirty,	"mlocked dirty lru",	me_pagecache_dirty },
> +	{ mlocked,		mlocked,	"mlocked lru",		me_pagecache_clean },
> +#endif
> +
> +	{ lru|dirty,		lru|dirty,	"dirty lru",		me_pagecache_dirty },
> +	{ lru|dirty,		lru,		"clean lru",		me_pagecache_clean },
> +	{ swapbacked,		swapbacked,	"anonymous",		me_pagecache_clean },
> +
> +	/*
> +	 * Add more states here.
> +	 */
> +
> +	/*
> +	 * Catchall entry: must be at end.
> +	 */
> +	{ 0,			0,		"unknown page state",	me_unknown },
> +};
> +
> +static void page_action(char *msg, struct page *p, int (*action)(struct page *),
> +			unsigned long pfn)
> +{
> +	int ret;
> +
> +	printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
> +	ret = action(p);
> +	printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
> +	       pfn, msg, action_name[ret]);
> +	if (page_count(p) != 1)
> +		printk(KERN_ERR
> +		       "MCE %#lx: %s page still referenced by %d users\n",
> +		       pfn, msg, page_count(p) - 1);
> +
> +	/* Could do more checks here if page looks ok */
> +	atomic_long_add(1, &mce_bad_pages);
> +
> +	/*
> +	 * Could adjust zone counters here to correct for the missing page.
> +	 */
> +}
> +
> +#define N_UNMAP_TRIES 5
> +
> +static void hwpoison_page_prepare(struct page *p, unsigned long pfn,
> +				  int trapno)
> +{
> +	enum ttu_flags ttu = TTU_UNMAP| TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	int kill = sysctl_memory_failure_early_kill;
> +	struct address_space *mapping;
> +	LIST_HEAD(tokill);
> +	int ret;
> +	int i;
> +
> +	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
> +		return;
> +
> +	if (!PageLRU(p))
> +		lru_add_drain();
> +
> +	/*
> +	 * This check implies we don't kill processes if their pages
> +	 * are in the swap cache early. Those are always late kills.
> +	 */
> +	if (!page_mapped(p))
> +		return;
> +
> +	if (PageSwapCache(p)) {
> +		printk(KERN_ERR
> +		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
> +		ttu |= TTU_IGNORE_HWPOISON;
> +	}
> +
> +	/*
> +	 * Poisoned clean file pages are harmless, the
> +	 * data can be restored by regular page faults.
> +	 */
> +	mapping = page_mapping(p);
> +	if (!PageDirty(p) && !PageWriteback(p) &&
> +	    !PageAnon(p) && !PageSwapBacked(p) &&
> +	    mapping && mapping_cap_account_dirty(mapping)) {
> +		if (page_mkclean(p))
> +			SetPageDirty(p);
> +		else {
> +			kill = 0;
> +			ttu |= TTU_IGNORE_HWPOISON;
> +		}
> +	}
> +
> +	/*
> +	 * First collect all the processes that have the page
> +	 * mapped.  This has to be done before try_to_unmap,
> +	 * because ttu takes the rmap data structures down.
> +	 *
> +	 * This also has the side effect to propagate the dirty
> +	 * bit from PTEs into the struct page. This is needed
> +	 * to actually decide if something needs to be killed
> +	 * or errored, or if it's ok to just drop the page.
> +	 *
> +	 * Error handling: We ignore errors here because
> +	 * there's nothing that can be done.
> +	 *
> +	 * RED-PEN some cases in process exit seem to deadlock
> +	 * on the page lock. drop it or add poison checks?
> +	 */
> +	if (kill)
> +		collect_procs(p, &tokill);
> +
> +	/*
> +	 * try_to_unmap can fail temporarily due to races.
> +	 * Try a few times (RED-PEN better strategy?)
> +	 */
> +	for (i = 0; i < N_UNMAP_TRIES; i++) {
> +		ret = try_to_unmap(p, ttu);
> +		if (ret == SWAP_SUCCESS)
> +			break;
> +		Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
> +	}
> +
> +	/*
> +	 * Now that the dirty bit has been propagated to the
> +	 * struct page and all unmaps done we can decide if
> +	 * killing is needed or not.  Only kill when the page
> +	 * was dirty, otherwise the tokill list is merely
> +	 * freed.  When there was a problem unmapping earlier
> +	 * use a more force-full uncatchable kill to prevent
> +	 * any accesses to the poisoned memory.
> +	 */
> +	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
> +		      ret != SWAP_SUCCESS, pfn);
> +}
> +
> +/**
> + * memory_failure - Handle memory failure of a page.
> + *
> + */
> +void memory_failure(unsigned long pfn, int trapno)
> +{
> +	struct page_state *ps;
> +	struct page *p;
> +
> +	if (!pfn_valid(pfn)) {
> +		printk(KERN_ERR
> +   "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
> +		       pfn);
> +		return;
> +	}
> +
> +
> +	p = pfn_to_page(pfn);
> +	if (TestSetPageHWPoison(p)) {
> +		printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
> +		return;
> +	}
> +
> +	/*
> +	 * We need/can do nothing about count=0 pages.
> +	 * 1) it's a free page, and therefore in safe hand:
> +	 *    prep_new_page() will be the gate keeper.
> +	 * 2) it's part of a non-compound high order page.
> +	 *    Implies some kernel user: cannot stop them from
> +	 *    R/W the page; let's pray that the page has been
> +	 *    used and will be freed some time later.
> +	 * In fact it's dangerous to directly bump up page count from 0,
> +	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
> +	 */
> +	if (!get_page_unless_zero(compound_head(p))) {
> +		printk(KERN_ERR
> +		       "MCE 0x%lx: ignoring free or high order page\n", pfn);
> +		return;
> +	}
> +
> +	lock_page_nosync(p);
> +	hwpoison_page_prepare(p, pfn, trapno);
> +
> +	/* Tored down by someone else? */
> +	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> +		printk(KERN_ERR
> +		       "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
> +		goto out;
> +	}
> +
> +	for (ps = error_states;; ps++) {
> +		if ((p->flags & ps->mask) == ps->res) {
> +			page_action(ps->msg, p, ps->action, pfn);
> +			break;
> +		}
> +	}
> +out:
> +	unlock_page(p);
> +}
> Index: linux/include/linux/mm.h
> ===================================================================
> --- linux.orig/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
> +++ linux/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
> @@ -1322,6 +1322,10 @@
>  
>  extern void *alloc_locked_buffer(size_t size);
>  extern void free_locked_buffer(void *buffer, size_t size);
> +
> +extern void memory_failure(unsigned long pfn, int trapno);
> +extern int sysctl_memory_failure_early_kill;
> +extern atomic_long_t mce_bad_pages;
>  extern void release_locked_buffer(void *buffer, size_t size);
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c	2009-05-27 21:23:18.000000000 +0200
> +++ linux/kernel/sysctl.c	2009-05-27 21:24:39.000000000 +0200
> @@ -1282,6 +1282,20 @@
>  		.proc_handler	= &scan_unevictable_handler,
>  	},
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "memory_failure_early_kill",
> +		.data		= &sysctl_memory_failure_early_kill,
> +		.maxlen		= sizeof(vm_highmem_is_dirtyable),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec_minmax,
> +		.strategy	= &sysctl_intvec,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +#endif
> +
>  /*
>   * NOTE: do not add new entries to this table unless you have read
>   * Documentation/sysctl/ctl_unnumbered.txt
> Index: linux/fs/proc/meminfo.c
> ===================================================================
> --- linux.orig/fs/proc/meminfo.c	2009-05-27 21:23:18.000000000 +0200
> +++ linux/fs/proc/meminfo.c	2009-05-27 21:24:39.000000000 +0200
> @@ -97,7 +97,11 @@
>  		"Committed_AS:   %8lu kB\n"
>  		"VmallocTotal:   %8lu kB\n"
>  		"VmallocUsed:    %8lu kB\n"
> -		"VmallocChunk:   %8lu kB\n",
> +		"VmallocChunk:   %8lu kB\n"
> +#ifdef CONFIG_MEMORY_FAILURE
> +		"BadPages:       %8lu kB\n"
> +#endif
> +		,
>  		K(i.totalram),
>  		K(i.freeram),
>  		K(i.bufferram),
> @@ -144,6 +148,9 @@
>  		(unsigned long)VMALLOC_TOTAL >> 10,
>  		vmi.used >> 10,
>  		vmi.largest_chunk >> 10
> +#ifdef CONFIG_MEMORY_FAILURE
> +		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
> +#endif
>  		);
>  
>  	hugetlb_report_meminfo(m);
> Index: linux/mm/Kconfig
> ===================================================================
> --- linux.orig/mm/Kconfig	2009-05-27 21:23:18.000000000 +0200
> +++ linux/mm/Kconfig	2009-05-27 21:24:39.000000000 +0200
> @@ -226,6 +226,9 @@
>  config MMU_NOTIFIER
>  	bool
>  
> +config MEMORY_FAILURE
> +	bool
> +
>  config NOMMU_INITIAL_TRIM_EXCESS
>  	int "Turn on mmap() excess space trimming before booting"
>  	depends on !MMU
> Index: linux/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux.orig/Documentation/sysctl/vm.txt	2009-05-27 21:23:18.000000000 +0200
> +++ linux/Documentation/sysctl/vm.txt	2009-05-27 21:24:39.000000000 +0200
> @@ -32,6 +32,7 @@
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- memory_failure_early_kill
>  - min_free_kbytes
>  - min_slab_ratio
>  - min_unmapped_ratio
> @@ -53,7 +54,6 @@
>  - vfs_cache_pressure
>  - zone_reclaim_mode
>  
> -
>  ==============================================================
>  
>  block_dump
> @@ -275,6 +275,25 @@
>  
>  The default value is 65536.
>  
> +=============================================================
> +
> +memory_failure_early_kill:
> +
> +Control how to kill processes when uncorrected memory error (typically
> +a 2bit error in a memory module) is detected in the background by hardware.
> +
> +1: Kill all processes that have the corrupted page mapped as soon as the
> +corruption is detected.
> +
> +0: Only unmap the page from all processes and only kill a process
> +who tries to access it.
> +
> +The kill is done using a catchable SIGBUS, so processes can handle this
> +if they want to.
> +
> +This is only active on architectures/platforms with advanced machine
> +check handling and depends on the hardware capabilities.
> +
>  ==============================================================
>  
>  min_free_kbytes:
> Index: linux/arch/x86/mm/fault.c
> ===================================================================
> --- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
> +++ linux/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
> @@ -851,8 +851,9 @@
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  	if (fault & VM_FAULT_HWPOISON) {
> -		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
> -			tsk->comm, tsk->pid);
> +		printk(KERN_ERR
> +       "MCE: Killing %s:%d for accessing hardware corrupted memory at %#lx\n",
> +			tsk->comm, tsk->pid, address);
>  		code = BUS_MCEERR_AR;
>  	}
>  #endif

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28  8:26     ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28  8:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Wed, May 27, 2009 at 10:12:39PM +0200, Andi Kleen wrote:
> 
> This patch adds the high level memory handler that poisons pages
> that got corrupted by hardware (typically by a bit flip in a DIMM
> or a cache) on the Linux level. Linux tries to access these
> pages in the future then.

Quick review.

> Index: linux/mm/Makefile
> ===================================================================
> --- linux.orig/mm/Makefile	2009-05-27 21:23:18.000000000 +0200
> +++ linux/mm/Makefile	2009-05-27 21:24:39.000000000 +0200
> @@ -38,3 +38,4 @@
>  endif
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> Index: linux/mm/memory-failure.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/mm/memory-failure.c	2009-05-27 21:28:19.000000000 +0200
> @@ -0,0 +1,677 @@
> +/*
> + * Copyright (C) 2008, 2009 Intel Corporation
> + * Author: Andi Kleen
> + *
> + * This software may be redistributed and/or modified under the terms of
> + * the GNU General Public License ("GPL") version 2 only as published by the
> + * Free Software Foundation.
> + *
> + * High level machine check handler. Handles pages reported by the
> + * hardware as being corrupted usually due to a 2bit ECC memory or cache
> + * failure.
> + *
> + * This focuses on pages detected as corrupted in the background.
> + * When the current CPU tries to consume corruption the currently
> + * running process can just be killed directly instead. This implies
> + * that if the error cannot be handled for some reason it's safe to
> + * just ignore it because no corruption has been consumed yet. Instead
> + * when that happens another machine check will happen.
> + *
> + * Handles page cache pages in various states.	The tricky part
> + * here is that we can access any page asynchronous to other VM
> + * users, because memory failures could happen anytime and anywhere,
> + * possibly violating some of their assumptions. This is why this code
> + * has to be extremely careful. Generally it tries to use normal locking
> + * rules, as in get the standard locks, even if that means the
> + * error handling takes potentially a long time.
> + *
> + * The operation to map back from RMAP chains to processes has to walk
> + * the complete process list and has non linear complexity with the number
> + * mappings. In short it can be quite slow. But since memory corruptions
> + * are rare we hope to get away with this.
> + */
> +
> +/*
> + * Notebook:
> + * - hugetlb needs more code
> + * - nonlinear
> + * - remap races
> + * - anonymous (tinject):
> + *   + left over references when process catches signal?
> + * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
> + * - pass bad pages to kdump next kernel
> + */
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/page-flags.h>
> +#include <linux/sched.h>
> +#include <linux/rmap.h>
> +#include <linux/pagemap.h>
> +#include <linux/swap.h>
> +#include <linux/backing-dev.h>
> +#include "internal.h"
> +
> +#define Dprintk(x...) printk(x)
> +
> +int sysctl_memory_failure_early_kill __read_mostly = 1;
> +
> +atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
> +
> +/*
> + * Send all the processes who have the page mapped an ``action optional''
> + * signal.
> + */
> +static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
> +			unsigned long pfn)
> +{
> +	struct siginfo si;
> +	int ret;
> +
> +	printk(KERN_ERR
> +		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
> +		pfn, t->comm, t->pid);
> +	si.si_signo = SIGBUS;
> +	si.si_errno = 0;
> +	si.si_code = BUS_MCEERR_AO;
> +	si.si_addr = (void *)addr;
> +#ifdef __ARCH_SI_TRAPNO
> +	si.si_trapno = trapno;
> +#endif
> +	si.si_addr_lsb = PAGE_SHIFT;
> +	/*
> +	 * Don't use force here, it's convenient if the signal
> +	 * can be temporarily blocked.
> +	 * This could cause a loop when the user sets SIGBUS
> +	 * to SIG_IGN, but hopefully noone will do that?
> +	 */
> +	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
> +	if (ret < 0)
> +		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
> +		       t->comm, t->pid, ret);
> +	return ret;
> +}
> +
> +/*
> + * Kill all processes that have a poisoned page mapped and then isolate
> + * the page.
> + *
> + * General strategy:
> + * Find all processes having the page mapped and kill them.
> + * But we keep a page reference around so that the page is not
> + * actually freed yet.
> + * Then stash the page away
> + *
> + * There's no convenient way to get back to mapped processes
> + * from the VMAs. So do a brute-force search over all
> + * running processes.
> + *
> + * Remember that machine checks are not common (or rather
> + * if they are common you have other problems), so this shouldn't
> + * be a performance issue.
> + *
> + * Also there are some races possible while we get from the
> + * error detection to actually handle it.
> + */
> +
> +struct to_kill {
> +	struct list_head nd;
> +	struct task_struct *tsk;
> +	unsigned long addr;
> +};

It would be kinda nice to have a field in task_struct that is usable
say for anyone holding the tasklist lock for write. Then you could
make a list with them. But I guess it isn't worthwhile unless there
are other users.

> +
> +/*
> + * Failure handling: if we can't find or can't kill a process there's
> + * not much we can do.	We just print a message and ignore otherwise.
> + */
> +
> +/*
> + * Schedule a process for later kill.
> + * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> + * TBD would GFP_NOIO be enough?
> + */
> +static void add_to_kill(struct task_struct *tsk, struct page *p,
> +		       struct vm_area_struct *vma,
> +		       struct list_head *to_kill,
> +		       struct to_kill **tkc)
> +{
> +	int fail = 0;
> +	struct to_kill *tk;
> +
> +	if (*tkc) {
> +		tk = *tkc;
> +		*tkc = NULL;
> +	} else {
> +		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> +		if (!tk) {
> +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> +			return;
> +		}
> +	}
> +	tk->addr = page_address_in_vma(p, vma);
> +	if (tk->addr == -EFAULT) {
> +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");

I don't know if this is very helpful message. I could legitimately happen and
nothing anybody can do about it...

> +		tk->addr = 0;
> +		fail = 1;

Fail doesn't seem to be used anywhere.


> +	}
> +	get_task_struct(tsk);
> +	tk->tsk = tsk;
> +	list_add_tail(&tk->nd, to_kill);
> +}
> +
> +/*
> + * Kill the processes that have been collected earlier.
> + */
> +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> +			  int fail, unsigned long pfn)

I guess "doit" etc is obvious once reading the code and caller, but maybe a
quick comment in the header to describe?

> +{
> +	struct to_kill *tk, *next;
> +
> +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> +		if (doit) {
> +			/*
> +			 * In case something went wrong with munmaping
> +			 * make sure the process doesn't catch the
> +			 * signal and then access the memory. So reset
> +			 * the signal handlers
> +			 */
> +			if (fail)
> +				flush_signal_handlers(tk->tsk, 1);

Is this a legitimate thing to do? Is it racy? Why would you not send a
sigkill or something if you want them to die right now?


> +
> +			/*
> +			 * In theory the process could have mapped
> +			 * something else on the address in-between. We could
> +			 * check for that, but we need to tell the
> +			 * process anyways.
> +			 */
> +			if (kill_proc_ao(tk->tsk, tk->addr, trapno, pfn) < 0)
> +				printk(KERN_ERR
> +		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
> +					pfn, tk->tsk->comm, tk->tsk->pid);
> +		}
> +		put_task_struct(tk->tsk);
> +		kfree(tk);
> +	}
> +}
> +
> +/*
> + * Collect processes when the error hit an anonymous page.
> + */
> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct anon_vma *av = page_lock_anon_vma(page);
> +
> +	if (av == NULL)	/* Not actually mapped anymore */
> +		return;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process (tsk) {
> +		if (!tsk->mm)
> +			continue;
> +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +		}
> +	}
> +	page_unlock_anon_vma(av);
> +	read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect processes when the error hit a file mapped page.
> + */
> +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct prio_tree_iter iter;
> +	struct address_space *mapping = page_mapping(page);
> +
> +	read_lock(&tasklist_lock);
> +	spin_lock(&mapping->i_mmap_lock);

You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
lock. And anon_vma lock nests inside i_mmap_lock.

This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
type (maybe -rt kernels do it), then you could have a task holding
anon_vma lock and waiting for tasklist_lock, and another holding tasklist
lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
waiting for anon_vma lock.

I think nesting either inside or outside these locks consistently is less
fragile. Do we already have a dependency?... I don't know of one, but you
should document this in mm/rmap.c and mm/filemap.c.


> +	for_each_process(tsk) {
> +		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +
> +		if (!tsk->mm)
> +			continue;
> +
> +		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
> +				      pgoff)
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect the processes who have the corrupted page mapped to kill.
> + * This is done in two steps for locking reasons.
> + * First preallocate one tokill structure outside the spin locks,
> + * so that we can kill at least one process reasonably reliable.
> + */
> +static void collect_procs(struct page *page, struct list_head *tokill)
> +{
> +	struct to_kill *tk;
> +
> +	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
> +	/* memory allocation failure is implicitly handled */

Well... it's explicitly handled... in the callee ;)


> +	if (PageAnon(page))
> +		collect_procs_anon(page, tokill, &tk);
> +	else
> +		collect_procs_file(page, tokill, &tk);
> +	kfree(tk);
> +}
> +
> +/*
> + * Error handlers for various types of pages.
> + */
> +
> +enum outcome {
> +	FAILED,
> +	DELAYED,
> +	IGNORED,
> +	RECOVERED,
> +};
> +
> +static const char *action_name[] = {
> +	[FAILED] = "Failed",
> +	[DELAYED] = "Delayed",
> +	[IGNORED] = "Ignored",

How is delayed different to ignored (or failed, for that matter)?


> +	[RECOVERED] = "Recovered",

And what does recovered mean? THe processes were killed and the page taken
out of circulation, but the machine is still in some unknown state of corruption
henceforth, right?


> +};
> +
> +/*
> + * Error hit kernel page.
> + * Do nothing, try to be lucky and not touch this instead. For a few cases we
> + * could be more sophisticated.
> + */
> +static int me_kernel(struct page *p)
> +{
> +	return DELAYED;
> +}
> +
> +/*
> + * Already poisoned page.
> + */
> +static int me_ignore(struct page *p)
> +{
> +	return IGNORED;
> +}
> +
> +/*
> + * Page in unknown state. Do nothing.
> + */
> +static int me_unknown(struct page *p)
> +{
> +	printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
> +	return FAILED;
> +}
> +
> +/*
> + * Free memory
> + */
> +static int me_free(struct page *p)
> +{
> +	return DELAYED;
> +}
> +
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p)
> +{
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	if (page_has_private(p))
> +		do_invalidatepage(p, 0);
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> +			page_to_pfn(p));
> +
> +	/*
> +	 * remove_from_page_cache assumes (mapping && !mapped)
> +	 */
> +	if (page_mapping(p) && !page_mapped(p)) {
> +		remove_from_page_cache(p);
> +		page_cache_release(p);
> +	}

remove_mapping would probably be a better idea. Otherwise you can
probably introduce pagecache removal vs page fault races which
will make the kernel bug.


> +
> +	return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p)
> +{
> +	struct address_space *mapping = page_mapping(p);
> +
> +	SetPageError(p);
> +	/* TBD: print more information about the file. */
> +	printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
> +			page_to_pfn(p));
> +	if (mapping) {
> +		/*
> +		 * Truncate does the same, but we're not quite the same
> +		 * as truncate. Needs more checking, but keep it for now.
> +		 */

What's different about truncate? It would be good to reuse as much as possible.


> +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> +
> +		/*
> +		 * IO error will be reported by write(), fsync(), etc.
> +		 * who check the mapping.
> +		 */
> +		mapping_set_error(mapping, EIO);

Interesting. It's not *exactly* an IO error (well, not like one we're usually
used to).


> +	}
> +
> +	me_pagecache_clean(p);
> +
> +	/*
> +	 * Did the earlier release work?
> +	 */
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		return FAILED;
> +
> +	return RECOVERED;
> +}
> +
> +/*
> + * Clean and dirty swap cache.
> + */
> +static int me_swapcache_dirty(struct page *p)
> +{
> +	ClearPageDirty(p);
> +
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	return DELAYED;
> +}
> +
> +static int me_swapcache_clean(struct page *p)
> +{
> +	ClearPageUptodate(p);
> +
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	delete_from_swap_cache(p);
> +
> +	return RECOVERED;
> +}

All these handlers are quite interesting in that they need to
know about most of the mm. What are you trying to do in each
of them would be a good idea to say, and probably they should
rather go into their appropriate files instead of all here
(eg. swapcache stuff should go in mm/swap_state for example).

You haven't waited on writeback here AFAIKS, and have you
*really* verified it is safe to call delete_from_swap_cache?



> +/*
> + * Huge pages. Needs work.
> + * Issues:
> + * No rmap support so we cannot find the original mapper. In theory could walk
> + * all MMs and look for the mappings, but that would be non atomic and racy.
> + * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
> + * like just walking the current process and hoping it has it mapped (that
> + * should be usually true for the common "shared database cache" case)
> + * Should handle free huge pages and dequeue them too, but this needs to
> + * handle huge page accounting correctly.
> + */
> +static int me_huge_page(struct page *p)
> +{
> +	return FAILED;
> +}
> +
> +/*
> + * Various page states we can handle.
> + *
> + * A page state is defined by its current page->flags bits.
> + * The table matches them in order and calls the right handler.
> + *
> + * This is quite tricky because we can access page at any time
> + * in its live cycle, so all accesses have to be extremly careful.
> + *
> + * This is not complete. More states could be added.
> + * For any missing state don't attempt recovery.
> + */
> +
> +#define dirty		(1UL << PG_dirty)
> +#define swapcache	(1UL << PG_swapcache)
> +#define unevict		(1UL << PG_unevictable)
> +#define mlocked		(1UL << PG_mlocked)
> +#define writeback	(1UL << PG_writeback)
> +#define lru		(1UL << PG_lru)
> +#define swapbacked	(1UL << PG_swapbacked)
> +#define head		(1UL << PG_head)
> +#define tail		(1UL << PG_tail)
> +#define compound	(1UL << PG_compound)
> +#define slab		(1UL << PG_slab)
> +#define buddy		(1UL << PG_buddy)
> +#define reserved	(1UL << PG_reserved)

This looks like more work than just putting 1UL << (...) in each entry
in your table. Hmm, does this whole table thing even buy you much (versus a
much simpler switch statement?)

And seeing as you are doing a lot of checking for various page flags anyway,
(eg. in your prepare function). Just seems like needless complexity.

> +
> +/*
> + * The table is > 80 columns because all the alternatvies were much worse.
> + */
> +
> +static struct page_state {
> +	unsigned long mask;
> +	unsigned long res;
> +	char *msg;
> +	int (*action)(struct page *p);
> +} error_states[] = {
> +	{ reserved,	reserved,	"reserved kernel",	me_ignore },
> +	{ buddy,	buddy,		"free kernel",		me_free },
> +
> +	/*
> +	 * Could in theory check if slab page is free or if we can drop
> +	 * currently unused objects without touching them. But just
> +	 * treat it as standard kernel for now.
> +	 */
> +	{ slab,			slab,		"kernel slab",		me_kernel },
> +
> +#ifdef CONFIG_PAGEFLAGS_EXTENDED
> +	{ head,			head,		"hugetlb",		me_huge_page },
> +	{ tail,			tail,		"hugetlb",		me_huge_page },
> +#else
> +	{ compound,		compound,	"hugetlb",		me_huge_page },
> +#endif
> +
> +	{ swapcache|dirty,	swapcache|dirty,"dirty swapcache",	me_swapcache_dirty },
> +	{ swapcache|dirty,	swapcache,	"clean swapcache",	me_swapcache_clean },
> +
> +#ifdef CONFIG_UNEVICTABLE_LRU
> +	{ unevict|dirty,	unevict|dirty,	"unevictable dirty lru", me_pagecache_dirty },
> +	{ unevict,		unevict,	"unevictable lru",	me_pagecache_clean },
> +#endif
> +
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> +	{ mlocked|dirty,	mlocked|dirty,	"mlocked dirty lru",	me_pagecache_dirty },
> +	{ mlocked,		mlocked,	"mlocked lru",		me_pagecache_clean },
> +#endif
> +
> +	{ lru|dirty,		lru|dirty,	"dirty lru",		me_pagecache_dirty },
> +	{ lru|dirty,		lru,		"clean lru",		me_pagecache_clean },
> +	{ swapbacked,		swapbacked,	"anonymous",		me_pagecache_clean },
> +
> +	/*
> +	 * Add more states here.
> +	 */
> +
> +	/*
> +	 * Catchall entry: must be at end.
> +	 */
> +	{ 0,			0,		"unknown page state",	me_unknown },
> +};
> +
> +static void page_action(char *msg, struct page *p, int (*action)(struct page *),
> +			unsigned long pfn)
> +{
> +	int ret;
> +
> +	printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
> +	ret = action(p);
> +	printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
> +	       pfn, msg, action_name[ret]);
> +	if (page_count(p) != 1)
> +		printk(KERN_ERR
> +		       "MCE %#lx: %s page still referenced by %d users\n",
> +		       pfn, msg, page_count(p) - 1);
> +
> +	/* Could do more checks here if page looks ok */
> +	atomic_long_add(1, &mce_bad_pages);
> +
> +	/*
> +	 * Could adjust zone counters here to correct for the missing page.
> +	 */
> +}
> +
> +#define N_UNMAP_TRIES 5
> +
> +static void hwpoison_page_prepare(struct page *p, unsigned long pfn,
> +				  int trapno)
> +{
> +	enum ttu_flags ttu = TTU_UNMAP| TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	int kill = sysctl_memory_failure_early_kill;
> +	struct address_space *mapping;
> +	LIST_HEAD(tokill);
> +	int ret;
> +	int i;
> +
> +	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
> +		return;
> +
> +	if (!PageLRU(p))
> +		lru_add_drain();
> +
> +	/*
> +	 * This check implies we don't kill processes if their pages
> +	 * are in the swap cache early. Those are always late kills.
> +	 */
> +	if (!page_mapped(p))
> +		return;
> +
> +	if (PageSwapCache(p)) {
> +		printk(KERN_ERR
> +		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
> +		ttu |= TTU_IGNORE_HWPOISON;
> +	}
> +
> +	/*
> +	 * Poisoned clean file pages are harmless, the
> +	 * data can be restored by regular page faults.
> +	 */
> +	mapping = page_mapping(p);
> +	if (!PageDirty(p) && !PageWriteback(p) &&
> +	    !PageAnon(p) && !PageSwapBacked(p) &&
> +	    mapping && mapping_cap_account_dirty(mapping)) {
> +		if (page_mkclean(p))
> +			SetPageDirty(p);
> +		else {
> +			kill = 0;
> +			ttu |= TTU_IGNORE_HWPOISON;
> +		}
> +	}
> +
> +	/*
> +	 * First collect all the processes that have the page
> +	 * mapped.  This has to be done before try_to_unmap,
> +	 * because ttu takes the rmap data structures down.
> +	 *
> +	 * This also has the side effect to propagate the dirty
> +	 * bit from PTEs into the struct page. This is needed
> +	 * to actually decide if something needs to be killed
> +	 * or errored, or if it's ok to just drop the page.
> +	 *
> +	 * Error handling: We ignore errors here because
> +	 * there's nothing that can be done.
> +	 *
> +	 * RED-PEN some cases in process exit seem to deadlock
> +	 * on the page lock. drop it or add poison checks?
> +	 */
> +	if (kill)
> +		collect_procs(p, &tokill);
> +
> +	/*
> +	 * try_to_unmap can fail temporarily due to races.
> +	 * Try a few times (RED-PEN better strategy?)
> +	 */
> +	for (i = 0; i < N_UNMAP_TRIES; i++) {
> +		ret = try_to_unmap(p, ttu);
> +		if (ret == SWAP_SUCCESS)
> +			break;
> +		Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
> +	}
> +
> +	/*
> +	 * Now that the dirty bit has been propagated to the
> +	 * struct page and all unmaps done we can decide if
> +	 * killing is needed or not.  Only kill when the page
> +	 * was dirty, otherwise the tokill list is merely
> +	 * freed.  When there was a problem unmapping earlier
> +	 * use a more force-full uncatchable kill to prevent
> +	 * any accesses to the poisoned memory.
> +	 */
> +	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
> +		      ret != SWAP_SUCCESS, pfn);
> +}
> +
> +/**
> + * memory_failure - Handle memory failure of a page.
> + *
> + */
> +void memory_failure(unsigned long pfn, int trapno)
> +{
> +	struct page_state *ps;
> +	struct page *p;
> +
> +	if (!pfn_valid(pfn)) {
> +		printk(KERN_ERR
> +   "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
> +		       pfn);
> +		return;
> +	}
> +
> +
> +	p = pfn_to_page(pfn);
> +	if (TestSetPageHWPoison(p)) {
> +		printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
> +		return;
> +	}
> +
> +	/*
> +	 * We need/can do nothing about count=0 pages.
> +	 * 1) it's a free page, and therefore in safe hand:
> +	 *    prep_new_page() will be the gate keeper.
> +	 * 2) it's part of a non-compound high order page.
> +	 *    Implies some kernel user: cannot stop them from
> +	 *    R/W the page; let's pray that the page has been
> +	 *    used and will be freed some time later.
> +	 * In fact it's dangerous to directly bump up page count from 0,
> +	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
> +	 */
> +	if (!get_page_unless_zero(compound_head(p))) {
> +		printk(KERN_ERR
> +		       "MCE 0x%lx: ignoring free or high order page\n", pfn);
> +		return;
> +	}
> +
> +	lock_page_nosync(p);
> +	hwpoison_page_prepare(p, pfn, trapno);
> +
> +	/* Tored down by someone else? */
> +	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> +		printk(KERN_ERR
> +		       "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
> +		goto out;
> +	}
> +
> +	for (ps = error_states;; ps++) {
> +		if ((p->flags & ps->mask) == ps->res) {
> +			page_action(ps->msg, p, ps->action, pfn);
> +			break;
> +		}
> +	}
> +out:
> +	unlock_page(p);
> +}
> Index: linux/include/linux/mm.h
> ===================================================================
> --- linux.orig/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
> +++ linux/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
> @@ -1322,6 +1322,10 @@
>  
>  extern void *alloc_locked_buffer(size_t size);
>  extern void free_locked_buffer(void *buffer, size_t size);
> +
> +extern void memory_failure(unsigned long pfn, int trapno);
> +extern int sysctl_memory_failure_early_kill;
> +extern atomic_long_t mce_bad_pages;
>  extern void release_locked_buffer(void *buffer, size_t size);
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c	2009-05-27 21:23:18.000000000 +0200
> +++ linux/kernel/sysctl.c	2009-05-27 21:24:39.000000000 +0200
> @@ -1282,6 +1282,20 @@
>  		.proc_handler	= &scan_unevictable_handler,
>  	},
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "memory_failure_early_kill",
> +		.data		= &sysctl_memory_failure_early_kill,
> +		.maxlen		= sizeof(vm_highmem_is_dirtyable),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec_minmax,
> +		.strategy	= &sysctl_intvec,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +#endif
> +
>  /*
>   * NOTE: do not add new entries to this table unless you have read
>   * Documentation/sysctl/ctl_unnumbered.txt
> Index: linux/fs/proc/meminfo.c
> ===================================================================
> --- linux.orig/fs/proc/meminfo.c	2009-05-27 21:23:18.000000000 +0200
> +++ linux/fs/proc/meminfo.c	2009-05-27 21:24:39.000000000 +0200
> @@ -97,7 +97,11 @@
>  		"Committed_AS:   %8lu kB\n"
>  		"VmallocTotal:   %8lu kB\n"
>  		"VmallocUsed:    %8lu kB\n"
> -		"VmallocChunk:   %8lu kB\n",
> +		"VmallocChunk:   %8lu kB\n"
> +#ifdef CONFIG_MEMORY_FAILURE
> +		"BadPages:       %8lu kB\n"
> +#endif
> +		,
>  		K(i.totalram),
>  		K(i.freeram),
>  		K(i.bufferram),
> @@ -144,6 +148,9 @@
>  		(unsigned long)VMALLOC_TOTAL >> 10,
>  		vmi.used >> 10,
>  		vmi.largest_chunk >> 10
> +#ifdef CONFIG_MEMORY_FAILURE
> +		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
> +#endif
>  		);
>  
>  	hugetlb_report_meminfo(m);
> Index: linux/mm/Kconfig
> ===================================================================
> --- linux.orig/mm/Kconfig	2009-05-27 21:23:18.000000000 +0200
> +++ linux/mm/Kconfig	2009-05-27 21:24:39.000000000 +0200
> @@ -226,6 +226,9 @@
>  config MMU_NOTIFIER
>  	bool
>  
> +config MEMORY_FAILURE
> +	bool
> +
>  config NOMMU_INITIAL_TRIM_EXCESS
>  	int "Turn on mmap() excess space trimming before booting"
>  	depends on !MMU
> Index: linux/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux.orig/Documentation/sysctl/vm.txt	2009-05-27 21:23:18.000000000 +0200
> +++ linux/Documentation/sysctl/vm.txt	2009-05-27 21:24:39.000000000 +0200
> @@ -32,6 +32,7 @@
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- memory_failure_early_kill
>  - min_free_kbytes
>  - min_slab_ratio
>  - min_unmapped_ratio
> @@ -53,7 +54,6 @@
>  - vfs_cache_pressure
>  - zone_reclaim_mode
>  
> -
>  ==============================================================
>  
>  block_dump
> @@ -275,6 +275,25 @@
>  
>  The default value is 65536.
>  
> +=============================================================
> +
> +memory_failure_early_kill:
> +
> +Control how to kill processes when uncorrected memory error (typically
> +a 2bit error in a memory module) is detected in the background by hardware.
> +
> +1: Kill all processes that have the corrupted page mapped as soon as the
> +corruption is detected.
> +
> +0: Only unmap the page from all processes and only kill a process
> +who tries to access it.
> +
> +The kill is done using a catchable SIGBUS, so processes can handle this
> +if they want to.
> +
> +This is only active on architectures/platforms with advanced machine
> +check handling and depends on the hardware capabilities.
> +
>  ==============================================================
>  
>  min_free_kbytes:
> Index: linux/arch/x86/mm/fault.c
> ===================================================================
> --- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
> +++ linux/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
> @@ -851,8 +851,9 @@
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  	if (fault & VM_FAULT_HWPOISON) {
> -		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
> -			tsk->comm, tsk->pid);
> +		printk(KERN_ERR
> +       "MCE: Killing %s:%d for accessing hardware corrupted memory at %#lx\n",
> +			tsk->comm, tsk->pid, address);
>  		code = BUS_MCEERR_AR;
>  	}
>  #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-28  8:03       ` Andi Kleen
@ 2009-05-28  8:28         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28  8:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 10:03:19AM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 09:27:03AM +0200, Nick Piggin wrote:
> > Not a bad idea, but I would prefer to have a set of flags which tell
> > try_to_unmap what to do, and then combine them with #defines for
> > callers. Like gfp flags.
> 
> That's exactly what the patch does?

There are a set of "actions" which is what the callers are, then a
set of modifiers. Just make it all modifiers and the callers can
use things that are | together.

 
> It just has actions and flags because the actions can be contradictory.
> 
> > And just use regular bitops rather than this TTU_ACTION macro.
> 
> TTU_ACTION does mask against multiple bits. None of the regular
> bitops do that. 

&, |  ?

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-05-28  8:28         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28  8:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 10:03:19AM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 09:27:03AM +0200, Nick Piggin wrote:
> > Not a bad idea, but I would prefer to have a set of flags which tell
> > try_to_unmap what to do, and then combine them with #defines for
> > callers. Like gfp flags.
> 
> That's exactly what the patch does?

There are a set of "actions" which is what the callers are, then a
set of modifiers. Just make it all modifiers and the callers can
use things that are | together.

 
> It just has actions and flags because the actions can be contradictory.
> 
> > And just use regular bitops rather than this TTU_ACTION macro.
> 
> TTU_ACTION does mask against multiple bits. None of the regular
> bitops do that. 

&, |  ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-28  8:46     ` Hidehiro Kawai
  -1 siblings, 0 replies; 232+ messages in thread
From: Hidehiro Kawai @ 2009-05-28  8:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, linux-mm, fengguang.wu, Satoshi OSHIMA,
	Taketoshi Sakuraba

Andi Kleen wrote:

> CPU migration uses special swap entry types to trigger special actions on page
> faults. Extend this mechanism to also support poisoned swap entries, to trigger
> poison handling on page faults. This allows followon patches to prevent 
> processes from faulting in poisoned pages again.
> 
> v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
>  include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c           |    4 ++--
>  3 files changed, 68 insertions(+), 8 deletions(-)
> 
> Index: linux/include/linux/swap.h
> ===================================================================
> --- linux.orig/include/linux/swap.h	2009-05-27 21:13:54.000000000 +0200
> +++ linux/include/linux/swap.h	2009-05-27 21:14:21.000000000 +0200
> @@ -34,16 +34,38 @@
>   * the type/offset into the pte as 5/27 as well.
>   */
>  #define MAX_SWAPFILES_SHIFT	5
> -#ifndef CONFIG_MIGRATION
> -#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
> +
> +/*
> + * Use some of the swap files numbers for other purposes. This
> + * is a convenient way to hook into the VM to trigger special
> + * actions on faults.
> + */
> +
> +/*
> + * NUMA node memory migration support
> + */
> +#ifdef CONFIG_MIGRATION
> +#define SWP_MIGRATION_NUM 2
> +#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
> +#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
>  #else
> -/* Use last two entries for page migration swap entries */
> -#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
> -#define SWP_MIGRATION_READ	MAX_SWAPFILES
> -#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
> +#define SWP_MIGRATION_NUM 0
>  #endif
>  
>  /*
> + * Handling of hardware poisoned pages with memory corruption.
> + */
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define SWP_HWPOISON_NUM 1
> +#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
> +#else
> +#define SWP_HWPOISON_NUM 0
> +#endif
> +
> +#define MAX_SWAPFILES \
> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)

I don't prefer this fix against the overflow issue.
For example, if both CONFIG_MIGRATION and CONFIG_MEMORY_FAILURE are
undefined, MAX_SWAPFILES is defined as 31.  But we should be able to
use up to 32 swap files/devices!

So instead, we should do:

#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)

#define SWP_HWPOISON		MAX_SWAPFILES

#define MAX_SWAPFILES \
	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)

and:

static inline int non_swap_entry(swp_entry_t entry)
{
	return swp_type(entry) >= MAX_SWAPFILES;
}


Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
@ 2009-05-28  8:46     ` Hidehiro Kawai
  0 siblings, 0 replies; 232+ messages in thread
From: Hidehiro Kawai @ 2009-05-28  8:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, linux-mm, fengguang.wu, Satoshi OSHIMA,
	Taketoshi Sakuraba

Andi Kleen wrote:

> CPU migration uses special swap entry types to trigger special actions on page
> faults. Extend this mechanism to also support poisoned swap entries, to trigger
> poison handling on page faults. This allows followon patches to prevent 
> processes from faulting in poisoned pages again.
> 
> v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
>  include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c           |    4 ++--
>  3 files changed, 68 insertions(+), 8 deletions(-)
> 
> Index: linux/include/linux/swap.h
> ===================================================================
> --- linux.orig/include/linux/swap.h	2009-05-27 21:13:54.000000000 +0200
> +++ linux/include/linux/swap.h	2009-05-27 21:14:21.000000000 +0200
> @@ -34,16 +34,38 @@
>   * the type/offset into the pte as 5/27 as well.
>   */
>  #define MAX_SWAPFILES_SHIFT	5
> -#ifndef CONFIG_MIGRATION
> -#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
> +
> +/*
> + * Use some of the swap files numbers for other purposes. This
> + * is a convenient way to hook into the VM to trigger special
> + * actions on faults.
> + */
> +
> +/*
> + * NUMA node memory migration support
> + */
> +#ifdef CONFIG_MIGRATION
> +#define SWP_MIGRATION_NUM 2
> +#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
> +#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
>  #else
> -/* Use last two entries for page migration swap entries */
> -#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
> -#define SWP_MIGRATION_READ	MAX_SWAPFILES
> -#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
> +#define SWP_MIGRATION_NUM 0
>  #endif
>  
>  /*
> + * Handling of hardware poisoned pages with memory corruption.
> + */
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define SWP_HWPOISON_NUM 1
> +#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
> +#else
> +#define SWP_HWPOISON_NUM 0
> +#endif
> +
> +#define MAX_SWAPFILES \
> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)

I don't prefer this fix against the overflow issue.
For example, if both CONFIG_MIGRATION and CONFIG_MEMORY_FAILURE are
undefined, MAX_SWAPFILES is defined as 31.  But we should be able to
use up to 32 swap files/devices!

So instead, we should do:

#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)

#define SWP_HWPOISON		MAX_SWAPFILES

#define MAX_SWAPFILES \
	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)

and:

static inline int non_swap_entry(swp_entry_t entry)
{
	return swp_type(entry) >= MAX_SWAPFILES;
}


Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-28  8:28         ` Nick Piggin
@ 2009-05-28  9:02           ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  9:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

Nick Piggin <npiggin@suse.de> writes:

> There are a set of "actions" which is what the callers are, then a
> set of modifiers. Just make it all modifiers and the callers can
> use things that are | together.

The actions are typically contradictory in some way, that is why
I made them "actions". The modifiers are all things that could
be made into flags in a straightforward way.

Probably it could be all turned into flags, but that would
make the patch much more intrusive for rmap.c than it currently is,
with some restructuring needed, which I didn't want to do.

Hwpoison in general is designed to not be intrusive.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-05-28  9:02           ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  9:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

Nick Piggin <npiggin@suse.de> writes:

> There are a set of "actions" which is what the callers are, then a
> set of modifiers. Just make it all modifiers and the callers can
> use things that are | together.

The actions are typically contradictory in some way, that is why
I made them "actions". The modifiers are all things that could
be made into flags in a straightforward way.

Probably it could be all turned into flags, but that would
make the patch much more intrusive for rmap.c than it currently is,
with some restructuring needed, which I didn't want to do.

Hwpoison in general is designed to not be intrusive.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
  2009-05-28  8:46     ` Hidehiro Kawai
@ 2009-05-28  9:11       ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28  9:11 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, akpm, linux-kernel, linux-mm, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Thu, May 28, 2009 at 04:46:56PM +0800, Hidehiro Kawai wrote:
> Andi Kleen wrote:
> 
> > CPU migration uses special swap entry types to trigger special actions on page
> > faults. Extend this mechanism to also support poisoned swap entries, to trigger
> > poison handling on page faults. This allows followon patches to prevent 
> > processes from faulting in poisoned pages again.
> > 
> > v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
> > 
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > ---
> >  include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
> >  include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c           |    4 ++--
> >  3 files changed, 68 insertions(+), 8 deletions(-)
> > 
> > Index: linux/include/linux/swap.h
> > ===================================================================
> > --- linux.orig/include/linux/swap.h	2009-05-27 21:13:54.000000000 +0200
> > +++ linux/include/linux/swap.h	2009-05-27 21:14:21.000000000 +0200
> > @@ -34,16 +34,38 @@
> >   * the type/offset into the pte as 5/27 as well.
> >   */
> >  #define MAX_SWAPFILES_SHIFT	5
> > -#ifndef CONFIG_MIGRATION
> > -#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
> > +
> > +/*
> > + * Use some of the swap files numbers for other purposes. This
> > + * is a convenient way to hook into the VM to trigger special
> > + * actions on faults.
> > + */
> > +
> > +/*
> > + * NUMA node memory migration support
> > + */
> > +#ifdef CONFIG_MIGRATION
> > +#define SWP_MIGRATION_NUM 2
> > +#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
> > +#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
> >  #else
> > -/* Use last two entries for page migration swap entries */
> > -#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
> > -#define SWP_MIGRATION_READ	MAX_SWAPFILES
> > -#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
> > +#define SWP_MIGRATION_NUM 0
> >  #endif
> >  
> >  /*
> > + * Handling of hardware poisoned pages with memory corruption.
> > + */
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +#define SWP_HWPOISON_NUM 1
> > +#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
> > +#else
> > +#define SWP_HWPOISON_NUM 0
> > +#endif
> > +
> > +#define MAX_SWAPFILES \
> > +	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
> 
> I don't prefer this fix against the overflow issue.
> For example, if both CONFIG_MIGRATION and CONFIG_MEMORY_FAILURE are
> undefined, MAX_SWAPFILES is defined as 31.  But we should be able to
> use up to 32 swap files/devices!
> 
> So instead, we should do:
> 
> #define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
> #define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
> 
> #define SWP_HWPOISON		MAX_SWAPFILES
> 
> #define MAX_SWAPFILES \
> 	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> 
> and:
> 
> static inline int non_swap_entry(swp_entry_t entry)
> {
> 	return swp_type(entry) >= MAX_SWAPFILES;
> }

Yes this is a better way to fix the overflow problem: when
SWP_HWPOISON=32 and it is shifted by SWP_TYPE_SHIFT and then shift
back, we get 0 (overflowed).

Andi, this patch does what Hidehiro describes.

---
 include/linux/swap.h    |    8 ++++----
 include/linux/swapops.h |    2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

--- linux.orig/include/linux/swap.h
+++ linux/include/linux/swap.h
@@ -46,8 +46,8 @@ static inline int current_is_kswapd(void
  */
 #ifdef CONFIG_MIGRATION
 #define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
 #define SWP_MIGRATION_NUM 0
 #endif
@@ -57,13 +57,13 @@ static inline int current_is_kswapd(void
  */
 #ifdef CONFIG_MEMORY_FAILURE
 #define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON 		(MAX_SWAPFILES + 1)
+#define SWP_HWPOISON 		MAX_SWAPFILES
 #else
 #define SWP_HWPOISON_NUM 0
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
--- linux.orig/include/linux/swapops.h
+++ linux/include/linux/swapops.h
@@ -161,7 +161,7 @@ static inline int is_hwpoison_entry(swp_
 #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
 static inline int non_swap_entry(swp_entry_t entry)
 {
-	return swp_type(entry) > MAX_SWAPFILES;
+	return swp_type(entry) >= MAX_SWAPFILES;
 }
 #else
 static inline int non_swap_entry(swp_entry_t entry)

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
@ 2009-05-28  9:11       ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28  9:11 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, akpm, linux-kernel, linux-mm, Satoshi OSHIMA,
	Taketoshi Sakuraba

On Thu, May 28, 2009 at 04:46:56PM +0800, Hidehiro Kawai wrote:
> Andi Kleen wrote:
> 
> > CPU migration uses special swap entry types to trigger special actions on page
> > faults. Extend this mechanism to also support poisoned swap entries, to trigger
> > poison handling on page faults. This allows followon patches to prevent 
> > processes from faulting in poisoned pages again.
> > 
> > v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
> > 
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > ---
> >  include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
> >  include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c           |    4 ++--
> >  3 files changed, 68 insertions(+), 8 deletions(-)
> > 
> > Index: linux/include/linux/swap.h
> > ===================================================================
> > --- linux.orig/include/linux/swap.h	2009-05-27 21:13:54.000000000 +0200
> > +++ linux/include/linux/swap.h	2009-05-27 21:14:21.000000000 +0200
> > @@ -34,16 +34,38 @@
> >   * the type/offset into the pte as 5/27 as well.
> >   */
> >  #define MAX_SWAPFILES_SHIFT	5
> > -#ifndef CONFIG_MIGRATION
> > -#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
> > +
> > +/*
> > + * Use some of the swap files numbers for other purposes. This
> > + * is a convenient way to hook into the VM to trigger special
> > + * actions on faults.
> > + */
> > +
> > +/*
> > + * NUMA node memory migration support
> > + */
> > +#ifdef CONFIG_MIGRATION
> > +#define SWP_MIGRATION_NUM 2
> > +#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
> > +#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
> >  #else
> > -/* Use last two entries for page migration swap entries */
> > -#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
> > -#define SWP_MIGRATION_READ	MAX_SWAPFILES
> > -#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
> > +#define SWP_MIGRATION_NUM 0
> >  #endif
> >  
> >  /*
> > + * Handling of hardware poisoned pages with memory corruption.
> > + */
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +#define SWP_HWPOISON_NUM 1
> > +#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
> > +#else
> > +#define SWP_HWPOISON_NUM 0
> > +#endif
> > +
> > +#define MAX_SWAPFILES \
> > +	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
> 
> I don't prefer this fix against the overflow issue.
> For example, if both CONFIG_MIGRATION and CONFIG_MEMORY_FAILURE are
> undefined, MAX_SWAPFILES is defined as 31.  But we should be able to
> use up to 32 swap files/devices!
> 
> So instead, we should do:
> 
> #define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
> #define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
> 
> #define SWP_HWPOISON		MAX_SWAPFILES
> 
> #define MAX_SWAPFILES \
> 	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> 
> and:
> 
> static inline int non_swap_entry(swp_entry_t entry)
> {
> 	return swp_type(entry) >= MAX_SWAPFILES;
> }

Yes this is a better way to fix the overflow problem: when
SWP_HWPOISON=32 and it is shifted by SWP_TYPE_SHIFT and then shift
back, we get 0 (overflowed).

Andi, this patch does what Hidehiro describes.

---
 include/linux/swap.h    |    8 ++++----
 include/linux/swapops.h |    2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

--- linux.orig/include/linux/swap.h
+++ linux/include/linux/swap.h
@@ -46,8 +46,8 @@ static inline int current_is_kswapd(void
  */
 #ifdef CONFIG_MIGRATION
 #define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
 #define SWP_MIGRATION_NUM 0
 #endif
@@ -57,13 +57,13 @@ static inline int current_is_kswapd(void
  */
 #ifdef CONFIG_MEMORY_FAILURE
 #define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON 		(MAX_SWAPFILES + 1)
+#define SWP_HWPOISON 		MAX_SWAPFILES
 #else
 #define SWP_HWPOISON_NUM 0
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
--- linux.orig/include/linux/swapops.h
+++ linux/include/linux/swapops.h
@@ -161,7 +161,7 @@ static inline int is_hwpoison_entry(swp_
 #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
 static inline int non_swap_entry(swp_entry_t entry)
 {
-	return swp_type(entry) > MAX_SWAPFILES;
+	return swp_type(entry) >= MAX_SWAPFILES;
 }
 #else
 static inline int non_swap_entry(swp_entry_t entry)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28  8:26     ` Nick Piggin
@ 2009-05-28  9:31       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  9:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Thu, May 28, 2009 at 10:26:16AM +0200, Nick Piggin wrote:

Thanks for the review.

> > + *
> > + * Also there are some races possible while we get from the
> > + * error detection to actually handle it.
> > + */
> > +
> > +struct to_kill {
> > +	struct list_head nd;
> > +	struct task_struct *tsk;
> > +	unsigned long addr;
> > +};
> 
> It would be kinda nice to have a field in task_struct that is usable
> say for anyone holding the tasklist lock for write. Then you could

I don't want to hold the tasklist lock for writing all the time, memory
failure handling can sleep.

> make a list with them. But I guess it isn't worthwhile unless there
> are other users.

It would need to be reserved for this, which definitely doesn't make
worth it. Also I need the  address too, a list head alone wouldn't be enough.



> > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > +			return;
> > +		}
> > +	}
> > +	tk->addr = page_address_in_vma(p, vma);
> > +	if (tk->addr == -EFAULT) {
> > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> 
> I don't know if this is very helpful message. I could legitimately happen and
> nothing anybody can do about it...

Can you suggest a better message?

> 
> > +		tk->addr = 0;
> > +		fail = 1;
> 
> Fail doesn't seem to be used anywhere.

Ah yes that was a remnant of a error checking scheme I discard later.
I'll remove it thanks.

> > +	list_add_tail(&tk->nd, to_kill);
> > +}
> > +
> > +/*
> > + * Kill the processes that have been collected earlier.
> > + */
> > +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> > +			  int fail, unsigned long pfn)
> 
> I guess "doit" etc is obvious once reading the code and caller, but maybe a
> quick comment in the header to describe?

Ok.

> 
> > +{
> > +	struct to_kill *tk, *next;
> > +
> > +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> > +		if (doit) {
> > +			/*
> > +			 * In case something went wrong with munmaping
> > +			 * make sure the process doesn't catch the
> > +			 * signal and then access the memory. So reset
> > +			 * the signal handlers
> > +			 */
> > +			if (fail)
> > +				flush_signal_handlers(tk->tsk, 1);
> 
> Is this a legitimate thing to do? Is it racy? Why would you not send a
> sigkill or something if you want them to die right now?

That's a very unlikely case it could be probably just removed, when
something during unmapping fails (mostly out of memory)

It's more paranoia than real need.

Yes SIGKILL would be probably better.

> > + */
> > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > +			      struct to_kill **tkc)
> > +{
> > +	struct vm_area_struct *vma;
> > +	struct task_struct *tsk;
> > +	struct prio_tree_iter iter;
> > +	struct address_space *mapping = page_mapping(page);
> > +
> > +	read_lock(&tasklist_lock);
> > +	spin_lock(&mapping->i_mmap_lock);
> 
> You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> lock. And anon_vma lock nests inside i_mmap_lock.
> 
> This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> type (maybe -rt kernels do it), then you could have a task holding
> anon_vma lock and waiting for tasklist_lock, and another holding tasklist
> lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
> waiting for anon_vma lock.

So you're saying I should change the order?

> 
> I think nesting either inside or outside these locks consistently is less
> fragile. Do we already have a dependency?... I don't know of one, but you
> should document this in mm/rmap.c and mm/filemap.c.

Ok.

> > +	DELAYED,
> > +	IGNORED,
> > +	RECOVERED,
> > +};
> > +
> > +static const char *action_name[] = {
> > +	[FAILED] = "Failed",
> > +	[DELAYED] = "Delayed",
> > +	[IGNORED] = "Ignored",
> 
> How is delayed different to ignored (or failed, for that matter)?

Part of it is documentation.

DELAYED means it's handled somewhere else (e.g. in the case of free pages)

> 
> 
> > +	[RECOVERED] = "Recovered",
> 
> And what does recovered mean? THe processes were killed and the page taken

Not necessarily killed, it might have been a clean page or so.

> out of circulation, but the machine is still in some unknown state of corruption
> henceforth, right?

It's in a known state of corruption -- there was this error on that page
and otherwise it's fine (or at least no errors known at this point)
The CPU generally tells you when it's in a unknown state and in this case this 
code is not executed, but just panic directly.


> > +
> > +	/*
> > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > +	 */
> > +	if (page_mapping(p) && !page_mapped(p)) {
> > +		remove_from_page_cache(p);
> > +		page_cache_release(p);
> > +	}
> 
> remove_mapping would probably be a better idea. Otherwise you can
> probably introduce pagecache removal vs page fault races whi
> will make the kernel bug.

Can you be more specific about the problems?

> > +			page_to_pfn(p));
> > +	if (mapping) {
> > +		/*
> > +		 * Truncate does the same, but we're not quite the same
> > +		 * as truncate. Needs more checking, but keep it for now.
> > +		 */
> 
> What's different about truncate? It would be good to reuse as much as possible.

Truncating removes the block on disk (we don't). Truncating shrinks
the end of the file (we don't). It's more "temporal hole punch"
Probably from the VM point of view it's very similar, but it's
not the same.

> 
> 
> > +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> > +
> > +		/*
> > +		 * IO error will be reported by write(), fsync(), etc.
> > +		 * who check the mapping.
> > +		 */
> > +		mapping_set_error(mapping, EIO);
> 
> Interesting. It's not *exactly* an IO error (well, not like one we're usually
> used to).

It's a new kind, but conceptually it's the same. Dirty IO data got corrupted.

We actually had a lot of grief with the error reporting; a lot of
code does "report error once then clear from mapping", which
broke all the tests for that in the test suite. IMHO that's a shady
area in the kernel.

Right now these are "expected but incorrect failures" in the tester.


> > +
> > +	delete_from_swap_cache(p);
> > +
> > +	return RECOVERED;
> > +}
> 
> All these handlers are quite interesting in that they need to
> know about most of the mm. What are you trying to do in each
> of them would be a good idea to say, and probably they should
> rather go into their appropriate files instead of all here
> (eg. swapcache stuff should go in mm/swap_state for example).

Hmm. I think I would prefer to first merge before
thinking about such things. But they could be moved at some 
point.

I suspect people first need to get more used to the idea of poisoned pages
before we can force it to them directly like this.

> 
> You haven't waited on writeback here AFAIKS, and have you
> *really* verified it is safe to call delete_from_swap_cache?

Verified in what way? me and Fengguang went over the code.
The original attempt at doing this was quite broken, but this
one should be better (it's the third iteration or so)


> > +
> > +#define dirty		(1UL << PG_dirty)
> > +#define swapcache	(1UL << PG_swapcache)
> > +#define unevict		(1UL << PG_unevictable)
> > +#define mlocked		(1UL << PG_mlocked)
> > +#define writeback	(1UL << PG_writeback)
> > +#define lru		(1UL << PG_lru)
> > +#define swapbacked	(1UL << PG_swapbacked)
> > +#define head		(1UL << PG_head)
> > +#define tail		(1UL << PG_tail)
> > +#define compound	(1UL << PG_compound)
> > +#define slab		(1UL << PG_slab)
> > +#define buddy		(1UL << PG_buddy)
> > +#define reserved	(1UL << PG_reserved)
> 
> This looks like more work than just putting 1UL << (...) in each entry

I had this originally, but it looked rather ugly.

> in your table. Hmm, does this whole table thing even buy you much (versus a
> much simpler switch statement?)

I don't think the switch would be particularly simple. Also I like
tables.

> 
> And seeing as you are doing a lot of checking for various page flags anyway,
> (eg. in your prepare function). Just seems like needless complexity.

Yes that grew over time unfortunately. Originally there was very little
explicit flag checking.

I still think the table is a good approach. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28  9:31       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28  9:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Thu, May 28, 2009 at 10:26:16AM +0200, Nick Piggin wrote:

Thanks for the review.

> > + *
> > + * Also there are some races possible while we get from the
> > + * error detection to actually handle it.
> > + */
> > +
> > +struct to_kill {
> > +	struct list_head nd;
> > +	struct task_struct *tsk;
> > +	unsigned long addr;
> > +};
> 
> It would be kinda nice to have a field in task_struct that is usable
> say for anyone holding the tasklist lock for write. Then you could

I don't want to hold the tasklist lock for writing all the time, memory
failure handling can sleep.

> make a list with them. But I guess it isn't worthwhile unless there
> are other users.

It would need to be reserved for this, which definitely doesn't make
worth it. Also I need the  address too, a list head alone wouldn't be enough.



> > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > +			return;
> > +		}
> > +	}
> > +	tk->addr = page_address_in_vma(p, vma);
> > +	if (tk->addr == -EFAULT) {
> > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> 
> I don't know if this is very helpful message. I could legitimately happen and
> nothing anybody can do about it...

Can you suggest a better message?

> 
> > +		tk->addr = 0;
> > +		fail = 1;
> 
> Fail doesn't seem to be used anywhere.

Ah yes that was a remnant of a error checking scheme I discard later.
I'll remove it thanks.

> > +	list_add_tail(&tk->nd, to_kill);
> > +}
> > +
> > +/*
> > + * Kill the processes that have been collected earlier.
> > + */
> > +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> > +			  int fail, unsigned long pfn)
> 
> I guess "doit" etc is obvious once reading the code and caller, but maybe a
> quick comment in the header to describe?

Ok.

> 
> > +{
> > +	struct to_kill *tk, *next;
> > +
> > +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> > +		if (doit) {
> > +			/*
> > +			 * In case something went wrong with munmaping
> > +			 * make sure the process doesn't catch the
> > +			 * signal and then access the memory. So reset
> > +			 * the signal handlers
> > +			 */
> > +			if (fail)
> > +				flush_signal_handlers(tk->tsk, 1);
> 
> Is this a legitimate thing to do? Is it racy? Why would you not send a
> sigkill or something if you want them to die right now?

That's a very unlikely case it could be probably just removed, when
something during unmapping fails (mostly out of memory)

It's more paranoia than real need.

Yes SIGKILL would be probably better.

> > + */
> > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > +			      struct to_kill **tkc)
> > +{
> > +	struct vm_area_struct *vma;
> > +	struct task_struct *tsk;
> > +	struct prio_tree_iter iter;
> > +	struct address_space *mapping = page_mapping(page);
> > +
> > +	read_lock(&tasklist_lock);
> > +	spin_lock(&mapping->i_mmap_lock);
> 
> You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> lock. And anon_vma lock nests inside i_mmap_lock.
> 
> This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> type (maybe -rt kernels do it), then you could have a task holding
> anon_vma lock and waiting for tasklist_lock, and another holding tasklist
> lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
> waiting for anon_vma lock.

So you're saying I should change the order?

> 
> I think nesting either inside or outside these locks consistently is less
> fragile. Do we already have a dependency?... I don't know of one, but you
> should document this in mm/rmap.c and mm/filemap.c.

Ok.

> > +	DELAYED,
> > +	IGNORED,
> > +	RECOVERED,
> > +};
> > +
> > +static const char *action_name[] = {
> > +	[FAILED] = "Failed",
> > +	[DELAYED] = "Delayed",
> > +	[IGNORED] = "Ignored",
> 
> How is delayed different to ignored (or failed, for that matter)?

Part of it is documentation.

DELAYED means it's handled somewhere else (e.g. in the case of free pages)

> 
> 
> > +	[RECOVERED] = "Recovered",
> 
> And what does recovered mean? THe processes were killed and the page taken

Not necessarily killed, it might have been a clean page or so.

> out of circulation, but the machine is still in some unknown state of corruption
> henceforth, right?

It's in a known state of corruption -- there was this error on that page
and otherwise it's fine (or at least no errors known at this point)
The CPU generally tells you when it's in a unknown state and in this case this 
code is not executed, but just panic directly.


> > +
> > +	/*
> > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > +	 */
> > +	if (page_mapping(p) && !page_mapped(p)) {
> > +		remove_from_page_cache(p);
> > +		page_cache_release(p);
> > +	}
> 
> remove_mapping would probably be a better idea. Otherwise you can
> probably introduce pagecache removal vs page fault races whi
> will make the kernel bug.

Can you be more specific about the problems?

> > +			page_to_pfn(p));
> > +	if (mapping) {
> > +		/*
> > +		 * Truncate does the same, but we're not quite the same
> > +		 * as truncate. Needs more checking, but keep it for now.
> > +		 */
> 
> What's different about truncate? It would be good to reuse as much as possible.

Truncating removes the block on disk (we don't). Truncating shrinks
the end of the file (we don't). It's more "temporal hole punch"
Probably from the VM point of view it's very similar, but it's
not the same.

> 
> 
> > +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> > +
> > +		/*
> > +		 * IO error will be reported by write(), fsync(), etc.
> > +		 * who check the mapping.
> > +		 */
> > +		mapping_set_error(mapping, EIO);
> 
> Interesting. It's not *exactly* an IO error (well, not like one we're usually
> used to).

It's a new kind, but conceptually it's the same. Dirty IO data got corrupted.

We actually had a lot of grief with the error reporting; a lot of
code does "report error once then clear from mapping", which
broke all the tests for that in the test suite. IMHO that's a shady
area in the kernel.

Right now these are "expected but incorrect failures" in the tester.


> > +
> > +	delete_from_swap_cache(p);
> > +
> > +	return RECOVERED;
> > +}
> 
> All these handlers are quite interesting in that they need to
> know about most of the mm. What are you trying to do in each
> of them would be a good idea to say, and probably they should
> rather go into their appropriate files instead of all here
> (eg. swapcache stuff should go in mm/swap_state for example).

Hmm. I think I would prefer to first merge before
thinking about such things. But they could be moved at some 
point.

I suspect people first need to get more used to the idea of poisoned pages
before we can force it to them directly like this.

> 
> You haven't waited on writeback here AFAIKS, and have you
> *really* verified it is safe to call delete_from_swap_cache?

Verified in what way? me and Fengguang went over the code.
The original attempt at doing this was quite broken, but this
one should be better (it's the third iteration or so)


> > +
> > +#define dirty		(1UL << PG_dirty)
> > +#define swapcache	(1UL << PG_swapcache)
> > +#define unevict		(1UL << PG_unevictable)
> > +#define mlocked		(1UL << PG_mlocked)
> > +#define writeback	(1UL << PG_writeback)
> > +#define lru		(1UL << PG_lru)
> > +#define swapbacked	(1UL << PG_swapbacked)
> > +#define head		(1UL << PG_head)
> > +#define tail		(1UL << PG_tail)
> > +#define compound	(1UL << PG_compound)
> > +#define slab		(1UL << PG_slab)
> > +#define buddy		(1UL << PG_buddy)
> > +#define reserved	(1UL << PG_reserved)
> 
> This looks like more work than just putting 1UL << (...) in each entry

I had this originally, but it looked rather ugly.

> in your table. Hmm, does this whole table thing even buy you much (versus a
> much simpler switch statement?)

I don't think the switch would be particularly simple. Also I like
tables.

> 
> And seeing as you are doing a lot of checking for various page flags anyway,
> (eg. in your prepare function). Just seems like needless complexity.

Yes that grew over time unfortunately. Originally there was very little
explicit flag checking.

I still think the table is a good approach. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28  8:26     ` Nick Piggin
@ 2009-05-28  9:59       ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28  9:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

Hi Nick,

On Thu, May 28, 2009 at 04:26:16PM +0800, Nick Piggin wrote:
> On Wed, May 27, 2009 at 10:12:39PM +0200, Andi Kleen wrote:
> >
> > This patch adds the high level memory handler that poisons pages
> > that got corrupted by hardware (typically by a bit flip in a DIMM
> > or a cache) on the Linux level. Linux tries to access these
> > pages in the future then.

[snip]

> > +/*
> > + * Clean (or cleaned) page cache page.
> > + */
> > +static int me_pagecache_clean(struct page *p)
> > +{
> > +     if (!isolate_lru_page(p))
> > +             page_cache_release(p);
> > +
> > +     if (page_has_private(p))
> > +             do_invalidatepage(p, 0);
> > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > +             Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > +                     page_to_pfn(p));
> > +
> > +     /*
> > +      * remove_from_page_cache assumes (mapping && !mapped)
> > +      */
> > +     if (page_mapping(p) && !page_mapped(p)) {
> > +             remove_from_page_cache(p);
> > +             page_cache_release(p);
> > +     }
> 
> remove_mapping would probably be a better idea. Otherwise you can
> probably introduce pagecache removal vs page fault races which
> will make the kernel bug.

We use remove_mapping() at first, then discovered that it made strong
assumption on page_count=2.

I guess it is safe from races since we are locking the page?

> 
> > +     }
> > +
> > +     me_pagecache_clean(p);
> > +
> > +     /*
> > +      * Did the earlier release work?
> > +      */
> > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > +             return FAILED;
> > +
> > +     return RECOVERED;
> > +}
> > +
> > +/*
> > + * Clean and dirty swap cache.
> > + */
> > +static int me_swapcache_dirty(struct page *p)
> > +{
> > +     ClearPageDirty(p);
> > +
> > +     if (!isolate_lru_page(p))
> > +             page_cache_release(p);
> > +
> > +     return DELAYED;
> > +}
> > +
> > +static int me_swapcache_clean(struct page *p)
> > +{
> > +     ClearPageUptodate(p);
> > +
> > +     if (!isolate_lru_page(p))
> > +             page_cache_release(p);
> > +
> > +     delete_from_swap_cache(p);
> > +
> > +     return RECOVERED;
> > +}
> 
> All these handlers are quite interesting in that they need to
> know about most of the mm. What are you trying to do in each
> of them would be a good idea to say, and probably they should
> rather go into their appropriate files instead of all here
> (eg. swapcache stuff should go in mm/swap_state for example).

Yup, they at least need more careful comments.

Dirty swap cache page is tricky to handle. The page could live both in page
cache and swap cache(ie. page is freshly swapped in). So it could be referenced
concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
the normal PTEs to swap PTEs, and then
        - clear dirty bit to prevent IO
        - remove from LRU
        - but keep in the swap cache, so that when we return to it on
          a later page fault, we know the application is accessing
          corrupted data and shall be killed (we installed simple
          interception code in do_swap_page to catch it).

Clean swap cache pages can be directly isolated. A later page fault will bring
in the known good data from disk.

> You haven't waited on writeback here AFAIKS, and have you
> *really* verified it is safe to call delete_from_swap_cache?

Good catch. I'll soon submit patches for handling the under
read/write IO pages. In this patchset they are simply ignored.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28  9:59       ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28  9:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

Hi Nick,

On Thu, May 28, 2009 at 04:26:16PM +0800, Nick Piggin wrote:
> On Wed, May 27, 2009 at 10:12:39PM +0200, Andi Kleen wrote:
> >
> > This patch adds the high level memory handler that poisons pages
> > that got corrupted by hardware (typically by a bit flip in a DIMM
> > or a cache) on the Linux level. Linux tries to access these
> > pages in the future then.

[snip]

> > +/*
> > + * Clean (or cleaned) page cache page.
> > + */
> > +static int me_pagecache_clean(struct page *p)
> > +{
> > +     if (!isolate_lru_page(p))
> > +             page_cache_release(p);
> > +
> > +     if (page_has_private(p))
> > +             do_invalidatepage(p, 0);
> > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > +             Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > +                     page_to_pfn(p));
> > +
> > +     /*
> > +      * remove_from_page_cache assumes (mapping && !mapped)
> > +      */
> > +     if (page_mapping(p) && !page_mapped(p)) {
> > +             remove_from_page_cache(p);
> > +             page_cache_release(p);
> > +     }
> 
> remove_mapping would probably be a better idea. Otherwise you can
> probably introduce pagecache removal vs page fault races which
> will make the kernel bug.

We use remove_mapping() at first, then discovered that it made strong
assumption on page_count=2.

I guess it is safe from races since we are locking the page?

> 
> > +     }
> > +
> > +     me_pagecache_clean(p);
> > +
> > +     /*
> > +      * Did the earlier release work?
> > +      */
> > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > +             return FAILED;
> > +
> > +     return RECOVERED;
> > +}
> > +
> > +/*
> > + * Clean and dirty swap cache.
> > + */
> > +static int me_swapcache_dirty(struct page *p)
> > +{
> > +     ClearPageDirty(p);
> > +
> > +     if (!isolate_lru_page(p))
> > +             page_cache_release(p);
> > +
> > +     return DELAYED;
> > +}
> > +
> > +static int me_swapcache_clean(struct page *p)
> > +{
> > +     ClearPageUptodate(p);
> > +
> > +     if (!isolate_lru_page(p))
> > +             page_cache_release(p);
> > +
> > +     delete_from_swap_cache(p);
> > +
> > +     return RECOVERED;
> > +}
> 
> All these handlers are quite interesting in that they need to
> know about most of the mm. What are you trying to do in each
> of them would be a good idea to say, and probably they should
> rather go into their appropriate files instead of all here
> (eg. swapcache stuff should go in mm/swap_state for example).

Yup, they at least need more careful comments.

Dirty swap cache page is tricky to handle. The page could live both in page
cache and swap cache(ie. page is freshly swapped in). So it could be referenced
concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
the normal PTEs to swap PTEs, and then
        - clear dirty bit to prevent IO
        - remove from LRU
        - but keep in the swap cache, so that when we return to it on
          a later page fault, we know the application is accessing
          corrupted data and shall be killed (we installed simple
          interception code in do_swap_page to catch it).

Clean swap cache pages can be directly isolated. A later page fault will bring
in the known good data from disk.

> You haven't waited on writeback here AFAIKS, and have you
> *really* verified it is safe to call delete_from_swap_cache?

Good catch. I'll soon submit patches for handling the under
read/write IO pages. In this patchset they are simply ignored.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28  9:59       ` Wu Fengguang
@ 2009-05-28 10:11         ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 10:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> Dirty swap cache page is tricky to handle. The page could live both in page
> cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> the normal PTEs to swap PTEs, and then
>         - clear dirty bit to prevent IO
>         - remove from LRU
>         - but keep in the swap cache, so that when we return to it on
>           a later page fault, we know the application is accessing
>           corrupted data and shall be killed (we installed simple
>           interception code in do_swap_page to catch it).

That's a good description. I'll add it as a comment to the code.

> > You haven't waited on writeback here AFAIKS, and have you
> > *really* verified it is safe to call delete_from_swap_cache?
> 
> Good catch. I'll soon submit patches for handling the under
> read/write IO pages. In this patchset they are simply ignored.

Yes, we assume the IO device does something sensible with the poisoned
cache lines and aborts. Later we can likely abort IO requests in a early
stage on the Linux, but that's more advanced.

The question is if we need to wait on writeback for correctness? 

We still don't want to crash if we take a page away that is currently
writebacked.

My original assumption was that taking the page lock would take
care of that. Is that not true?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 10:11         ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 10:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> Dirty swap cache page is tricky to handle. The page could live both in page
> cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> the normal PTEs to swap PTEs, and then
>         - clear dirty bit to prevent IO
>         - remove from LRU
>         - but keep in the swap cache, so that when we return to it on
>           a later page fault, we know the application is accessing
>           corrupted data and shall be killed (we installed simple
>           interception code in do_swap_page to catch it).

That's a good description. I'll add it as a comment to the code.

> > You haven't waited on writeback here AFAIKS, and have you
> > *really* verified it is safe to call delete_from_swap_cache?
> 
> Good catch. I'll soon submit patches for handling the under
> read/write IO pages. In this patchset they are simply ignored.

Yes, we assume the IO device does something sensible with the poisoned
cache lines and aborts. Later we can likely abort IO requests in a early
stage on the Linux, but that's more advanced.

The question is if we need to wait on writeback for correctness? 

We still don't want to crash if we take a page away that is currently
writebacked.

My original assumption was that taking the page lock would take
care of that. Is that not true?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 10:11         ` Andi Kleen
@ 2009-05-28 10:33           ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 10:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 06:11:11PM +0800, Andi Kleen wrote:
> On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > Dirty swap cache page is tricky to handle. The page could live both in page
> > cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> > concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> > handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> > the normal PTEs to swap PTEs, and then
> >         - clear dirty bit to prevent IO
> >         - remove from LRU
> >         - but keep in the swap cache, so that when we return to it on
> >           a later page fault, we know the application is accessing
> >           corrupted data and shall be killed (we installed simple
> >           interception code in do_swap_page to catch it).
> 
> That's a good description. I'll add it as a comment to the code.

OK, thanks.

> > > You haven't waited on writeback here AFAIKS, and have you
> > > *really* verified it is safe to call delete_from_swap_cache?
> > 
> > Good catch. I'll soon submit patches for handling the under
> > read/write IO pages. In this patchset they are simply ignored.
> 
> Yes, we assume the IO device does something sensible with the poisoned
> cache lines and aborts. Later we can likely abort IO requests in a early
> stage on the Linux, but that's more advanced.
> 
> The question is if we need to wait on writeback for correctness? 

Not necessary. Because I'm going to add a me_writeback() handler.

Then the writeback pages simply won't reach here. And it won't
magically go into writeback state, since the page has been locked.

Thanks,
Fengguang

> We still don't want to crash if we take a page away that is currently
> writebacked.
> 
> My original assumption was that taking the page lock would take
> care of that. Is that not true?
> 
> -Andi
> -- 
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 10:33           ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 10:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 06:11:11PM +0800, Andi Kleen wrote:
> On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > Dirty swap cache page is tricky to handle. The page could live both in page
> > cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> > concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> > handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> > the normal PTEs to swap PTEs, and then
> >         - clear dirty bit to prevent IO
> >         - remove from LRU
> >         - but keep in the swap cache, so that when we return to it on
> >           a later page fault, we know the application is accessing
> >           corrupted data and shall be killed (we installed simple
> >           interception code in do_swap_page to catch it).
> 
> That's a good description. I'll add it as a comment to the code.

OK, thanks.

> > > You haven't waited on writeback here AFAIKS, and have you
> > > *really* verified it is safe to call delete_from_swap_cache?
> > 
> > Good catch. I'll soon submit patches for handling the under
> > read/write IO pages. In this patchset they are simply ignored.
> 
> Yes, we assume the IO device does something sensible with the poisoned
> cache lines and aborts. Later we can likely abort IO requests in a early
> stage on the Linux, but that's more advanced.
> 
> The question is if we need to wait on writeback for correctness? 

Not necessary. Because I'm going to add a me_writeback() handler.

Then the writeback pages simply won't reach here. And it won't
magically go into writeback state, since the page has been locked.

Thanks,
Fengguang

> We still don't want to crash if we take a page away that is currently
> writebacked.
> 
> My original assumption was that taking the page lock would take
> care of that. Is that not true?
> 
> -Andi
> -- 
> ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
  2009-05-28  8:46     ` Hidehiro Kawai
@ 2009-05-28 10:42       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 10:42 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu,
	Satoshi OSHIMA, Taketoshi Sakuraba

On Thu, May 28, 2009 at 05:46:56PM +0900, Hidehiro Kawai wrote:
> > + */
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +#define SWP_HWPOISON_NUM 1
> > +#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
> > +#else
> > +#define SWP_HWPOISON_NUM 0
> > +#endif
> > +
> > +#define MAX_SWAPFILES \
> > +	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
> 
> I don't prefer this fix against the overflow issue.
> For example, if both CONFIG_MIGRATION and CONFIG_MEMORY_FAILURE are
> undefined, MAX_SWAPFILES is defined as 31.  But we should be able to
> use up to 32 swap files/devices!

Ok. Applied thanks. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
@ 2009-05-28 10:42       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 10:42 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu,
	Satoshi OSHIMA, Taketoshi Sakuraba

On Thu, May 28, 2009 at 05:46:56PM +0900, Hidehiro Kawai wrote:
> > + */
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +#define SWP_HWPOISON_NUM 1
> > +#define SWP_HWPOISON		(MAX_SWAPFILES + 1)
> > +#else
> > +#define SWP_HWPOISON_NUM 0
> > +#endif
> > +
> > +#define MAX_SWAPFILES \
> > +	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - 1)
> 
> I don't prefer this fix against the overflow issue.
> For example, if both CONFIG_MIGRATION and CONFIG_MEMORY_FAILURE are
> undefined, MAX_SWAPFILES is defined as 31.  But we should be able to
> use up to 32 swap files/devices!

Ok. Applied thanks. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 10:33           ` Wu Fengguang
@ 2009-05-28 10:51             ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 10:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 06:33:00PM +0800, Wu Fengguang wrote:
> > > > You haven't waited on writeback here AFAIKS, and have you
> > > > *really* verified it is safe to call delete_from_swap_cache?
> > > 
> > > Good catch. I'll soon submit patches for handling the under
> > > read/write IO pages. In this patchset they are simply ignored.
> > 
> > Yes, we assume the IO device does something sensible with the poisoned
> > cache lines and aborts. Later we can likely abort IO requests in a early
> > stage on the Linux, but that's more advanced.
> > 
> > The question is if we need to wait on writeback for correctness? 
> 
> Not necessary. Because I'm going to add a me_writeback() handler.

Ok but without it. Let's assume me_writeback() is in the future.

I'm mainly interested in correctness (as in not crashing) of this
version now.

Also writeback seems to be only used by nfs/afs/nilfs2, not in
the normal case, unless I'm misreading the code. 

The nilfs2 case seems weird, I haven't completely read that.

> Then the writeback pages simply won't reach here. And it won't
> magically go into writeback state, since the page has been locked.

But since we take the page lock they should not be in writeback anyways,
no?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 10:51             ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 10:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 06:33:00PM +0800, Wu Fengguang wrote:
> > > > You haven't waited on writeback here AFAIKS, and have you
> > > > *really* verified it is safe to call delete_from_swap_cache?
> > > 
> > > Good catch. I'll soon submit patches for handling the under
> > > read/write IO pages. In this patchset they are simply ignored.
> > 
> > Yes, we assume the IO device does something sensible with the poisoned
> > cache lines and aborts. Later we can likely abort IO requests in a early
> > stage on the Linux, but that's more advanced.
> > 
> > The question is if we need to wait on writeback for correctness? 
> 
> Not necessary. Because I'm going to add a me_writeback() handler.

Ok but without it. Let's assume me_writeback() is in the future.

I'm mainly interested in correctness (as in not crashing) of this
version now.

Also writeback seems to be only used by nfs/afs/nilfs2, not in
the normal case, unless I'm misreading the code. 

The nilfs2 case seems weird, I haven't completely read that.

> Then the writeback pages simply won't reach here. And it won't
> magically go into writeback state, since the page has been locked.

But since we take the page lock they should not be in writeback anyways,
no?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 10:51             ` Andi Kleen
@ 2009-05-28 11:03               ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 11:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 06:51:03PM +0800, Andi Kleen wrote:
> On Thu, May 28, 2009 at 06:33:00PM +0800, Wu Fengguang wrote:
> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > > 
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > > 
> > > Yes, we assume the IO device does something sensible with the poisoned
> > > cache lines and aborts. Later we can likely abort IO requests in a early
> > > stage on the Linux, but that's more advanced.
> > > 
> > > The question is if we need to wait on writeback for correctness? 
> > 
> > Not necessary. Because I'm going to add a me_writeback() handler.
> 
> Ok but without it. Let's assume me_writeback() is in the future.
> 
> I'm mainly interested in correctness (as in not crashing) of this
> version now.

OK. But I suspect it doesn't matter whether to wait on the writeback
page in the sense of "don't crash".

> Also writeback seems to be only used by nfs/afs/nilfs2, not in
> the normal case, unless I'm misreading the code. 
> 
> The nilfs2 case seems weird, I haven't completely read that.

I'll take a look at that :)

> > Then the writeback pages simply won't reach here. And it won't
> > magically go into writeback state, since the page has been locked.
> 
> But since we take the page lock they should not be in writeback anyways,
> no?

No, writeback pages are normally not locked. But the dirty => writeback
transition is locked.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 11:03               ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 11:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 06:51:03PM +0800, Andi Kleen wrote:
> On Thu, May 28, 2009 at 06:33:00PM +0800, Wu Fengguang wrote:
> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > > 
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > > 
> > > Yes, we assume the IO device does something sensible with the poisoned
> > > cache lines and aborts. Later we can likely abort IO requests in a early
> > > stage on the Linux, but that's more advanced.
> > > 
> > > The question is if we need to wait on writeback for correctness? 
> > 
> > Not necessary. Because I'm going to add a me_writeback() handler.
> 
> Ok but without it. Let's assume me_writeback() is in the future.
> 
> I'm mainly interested in correctness (as in not crashing) of this
> version now.

OK. But I suspect it doesn't matter whether to wait on the writeback
page in the sense of "don't crash".

> Also writeback seems to be only used by nfs/afs/nilfs2, not in
> the normal case, unless I'm misreading the code. 
> 
> The nilfs2 case seems weird, I haven't completely read that.

I'll take a look at that :)

> > Then the writeback pages simply won't reach here. And it won't
> > magically go into writeback state, since the page has been locked.
> 
> But since we take the page lock they should not be in writeback anyways,
> no?

No, writeback pages are normally not locked. But the dirty => writeback
transition is locked.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28  9:31       ` Andi Kleen
@ 2009-05-28 12:08         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 11:31:41AM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 10:26:16AM +0200, Nick Piggin wrote:
> 
> Thanks for the review.
> 
> > > + *
> > > + * Also there are some races possible while we get from the
> > > + * error detection to actually handle it.
> > > + */
> > > +
> > > +struct to_kill {
> > > +	struct list_head nd;
> > > +	struct task_struct *tsk;
> > > +	unsigned long addr;
> > > +};
> > 
> > It would be kinda nice to have a field in task_struct that is usable
> > say for anyone holding the tasklist lock for write. Then you could
> 
> I don't want to hold the tasklist lock for writing all the time, memory
> failure handling can sleep.
> 
> > make a list with them. But I guess it isn't worthwhile unless there
> > are other users.
> 
> It would need to be reserved for this, which definitely doesn't make
> worth it. Also I need the  address too, a list head alone wouldn't be enough.

Right, it was just an idea. It would not have to be reserved for that
so long as it was synchronized with the right lock. But not a big deal,
forget it.


> > > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > > +			return;
> > > +		}
> > > +	}
> > > +	tk->addr = page_address_in_vma(p, vma);
> > > +	if (tk->addr == -EFAULT) {
> > > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> > 
> > I don't know if this is very helpful message. I could legitimately happen and
> > nothing anybody can do about it...
> 
> Can you suggest a better message?

Well, for userspace, nothing? At the very least ratelimited, and preferably
telling a more high level of what the problem and consequences are.


> > > +		tk->addr = 0;
> > > +		fail = 1;
> > 
> > Fail doesn't seem to be used anywhere.
> 
> Ah yes that was a remnant of a error checking scheme I discard later.
> I'll remove it thanks.
> 
> > > +	list_add_tail(&tk->nd, to_kill);
> > > +}
> > > +
> > > +/*
> > > + * Kill the processes that have been collected earlier.
> > > + */
> > > +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> > > +			  int fail, unsigned long pfn)
> > 
> > I guess "doit" etc is obvious once reading the code and caller, but maybe a
> > quick comment in the header to describe?
> 
> Ok.
> 
> > 
> > > +{
> > > +	struct to_kill *tk, *next;
> > > +
> > > +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> > > +		if (doit) {
> > > +			/*
> > > +			 * In case something went wrong with munmaping
> > > +			 * make sure the process doesn't catch the
> > > +			 * signal and then access the memory. So reset
> > > +			 * the signal handlers
> > > +			 */
> > > +			if (fail)
> > > +				flush_signal_handlers(tk->tsk, 1);
> > 
> > Is this a legitimate thing to do? Is it racy? Why would you not send a
> > sigkill or something if you want them to die right now?
> 
> That's a very unlikely case it could be probably just removed, when
> something during unmapping fails (mostly out of memory)
> 
> It's more paranoia than real need.
> 
> Yes SIGKILL would be probably better.

OK, maybe just remove it? (keep simple first?)


> > > + */
> > > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > > +			      struct to_kill **tkc)
> > > +{
> > > +	struct vm_area_struct *vma;
> > > +	struct task_struct *tsk;
> > > +	struct prio_tree_iter iter;
> > > +	struct address_space *mapping = page_mapping(page);
> > > +
> > > +	read_lock(&tasklist_lock);
> > > +	spin_lock(&mapping->i_mmap_lock);
> > 
> > You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> > lock. And anon_vma lock nests inside i_mmap_lock.
> > 
> > This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> > type (maybe -rt kernels do it), then you could have a task holding
> > anon_vma lock and waiting for tasklist_lock, and another holding tasklist
> > lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
> > waiting for anon_vma lock.
> 
> So you're saying I should change the order?

Well I don't _think_ we have a dependency already. Yes I would just change
the order to be either outside both VM locks or inside both. Maybe with
a note that it does not really matter which order (in case another user
comes up who needs the opposite ordering).


> > I think nesting either inside or outside these locks consistently is less
> > fragile. Do we already have a dependency?... I don't know of one, but you
> > should document this in mm/rmap.c and mm/filemap.c.
> 
> Ok.
> 
> > > +	DELAYED,
> > > +	IGNORED,
> > > +	RECOVERED,
> > > +};
> > > +
> > > +static const char *action_name[] = {
> > > +	[FAILED] = "Failed",
> > > +	[DELAYED] = "Delayed",
> > > +	[IGNORED] = "Ignored",
> > 
> > How is delayed different to ignored (or failed, for that matter)?
> 
> Part of it is documentation.
> 
> DELAYED means it's handled somewhere else (e.g. in the case of free pages)
> 
> > 
> > 
> > > +	[RECOVERED] = "Recovered",
> > 
> > And what does recovered mean? THe processes were killed and the page taken
> 
> Not necessarily killed, it might have been a clean page or so.
> 
> > out of circulation, but the machine is still in some unknown state of corruption
> > henceforth, right?
> 
> It's in a known state of corruption -- there was this error on that page
> and otherwise it's fine (or at least no errors known at this point)
> The CPU generally tells you when it's in a unknown state and in this case this 
> code is not executed, but just panic directly.

Then the data can not have been consumed, by DMA or otherwise? What
about transient kernel references to the (pagecache/anonymous) page
(such as, find_get_page for read(2), or get_user_pages callers).


> > > +
> > > +	/*
> > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > +	 */
> > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > +		remove_from_page_cache(p);
> > > +		page_cache_release(p);
> > > +	}
> > 
> > remove_mapping would probably be a better idea. Otherwise you can
> > probably introduce pagecache removal vs page fault races whi
> > will make the kernel bug.
> 
> Can you be more specific about the problems?

Hmm, actually now that we hold the page lock over __do_fault (at least
for pagecache pages), this may not be able to trigger the race I was
thinking of (page becoming mapped). But I think still it is better
to use remove_mapping which is the standard way to remove such a page.

BTW. I don't know if you are checking for PG_writeback often enough?
You can't remove a PG_writeback page from pagecache. The normal
pattern is lock_page(page); wait_on_page_writeback(page); which I
think would be safest (then you never have to bother with the writeback
bit again).


> > > +			page_to_pfn(p));
> > > +	if (mapping) {
> > > +		/*
> > > +		 * Truncate does the same, but we're not quite the same
> > > +		 * as truncate. Needs more checking, but keep it for now.
> > > +		 */
> > 
> > What's different about truncate? It would be good to reuse as much as possible.
> 
> Truncating removes the block on disk (we don't). Truncating shrinks
> the end of the file (we don't). It's more "temporal hole punch"
> Probably from the VM point of view it's very similar, but it's
> not the same.

Right, I just mean the pagecache side of the truncate. So you should
use truncate_inode_pages_range here.


> > > +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> > > +
> > > +		/*
> > > +		 * IO error will be reported by write(), fsync(), etc.
> > > +		 * who check the mapping.
> > > +		 */
> > > +		mapping_set_error(mapping, EIO);
> > 
> > Interesting. It's not *exactly* an IO error (well, not like one we're usually
> > used to).
> 
> It's a new kind, but conceptually it's the same. Dirty IO data got corrupted.

Well, the dirty data has never been corrupted before (ie. the data
in pagecache has been OK). It was just unable to make it back to
backing store. So a program could retry the write/fsync/etc or
try to write the data somewhere else.

It kind of wants a new error code, but I can't imagine the difficulty
in doing that...


> We actually had a lot of grief with the error reporting; a lot of
> code does "report error once then clear from mapping", which
> broke all the tests for that in the test suite. IMHO that's a shady
> area in the kernel.

Yeah, it's annoying. I ran over this problem when auditing some
data integrity problems in the kernel recently. IIRC even a
misplaced sync from another process can come and suck up the IO
error that your DBMS was expecting to get from fsync.

We kind of need another type of syscall to tell the kernel whether
it can discard errors for a given file/range. But this is not the
problem for your patch.


> Right now these are "expected but incorrect failures" in the tester.
> 
> 
> > > +
> > > +	delete_from_swap_cache(p);
> > > +
> > > +	return RECOVERED;
> > > +}
> > 
> > All these handlers are quite interesting in that they need to
> > know about most of the mm. What are you trying to do in each
> > of them would be a good idea to say, and probably they should
> > rather go into their appropriate files instead of all here
> > (eg. swapcache stuff should go in mm/swap_state for example).
> 
> Hmm. I think I would prefer to first merge before
> thinking about such things. But they could be moved at some 
> point.
> 
> I suspect people first need to get more used to the idea of poisoned pages
> before we can force it to them directly like this.

Well these are messing with the internals of those subsystems, so
maintainers etc do need to think about such things and get used
to the idea.

I think it is actually a good idea thinking about it more. Basically
each subsystem will just have calls in response to handle errors in
their pages.


> > You haven't waited on writeback here AFAIKS, and have you
> > *really* verified it is safe to call delete_from_swap_cache?
> 
> Verified in what way? me and Fengguang went over the code.
> The original attempt at doing this was quite broken, but this
> one should be better (it's the third iteration or so)

I was thinking maybe it can be PG_writeback at that point which
would go BUG there. Maybe I missed somewhere where you filter
that out.


> > > +
> > > +#define dirty		(1UL << PG_dirty)
> > > +#define swapcache	(1UL << PG_swapcache)
> > > +#define unevict		(1UL << PG_unevictable)
> > > +#define mlocked		(1UL << PG_mlocked)
> > > +#define writeback	(1UL << PG_writeback)
> > > +#define lru		(1UL << PG_lru)
> > > +#define swapbacked	(1UL << PG_swapbacked)
> > > +#define head		(1UL << PG_head)
> > > +#define tail		(1UL << PG_tail)
> > > +#define compound	(1UL << PG_compound)
> > > +#define slab		(1UL << PG_slab)
> > > +#define buddy		(1UL << PG_buddy)
> > > +#define reserved	(1UL << PG_reserved)
> > 
> > This looks like more work than just putting 1UL << (...) in each entry
> 
> I had this originally, but it looked rather ugly.
> 
> > in your table. Hmm, does this whole table thing even buy you much (versus a
> > much simpler switch statement?)
> 
> I don't think the switch would be particularly simple. Also I like
> tables.
> 
> > 
> > And seeing as you are doing a lot of checking for various page flags anyway,
> > (eg. in your prepare function). Just seems like needless complexity.
> 
> Yes that grew over time unfortunately. Originally there was very little
> explicit flag checking.
> 
> I still think the table is a good approach. 

Just seems overengineered. We could rewrite any if/switch statement like
that (and actually the compiler probably will if it is beneficial).


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 12:08         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 11:31:41AM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 10:26:16AM +0200, Nick Piggin wrote:
> 
> Thanks for the review.
> 
> > > + *
> > > + * Also there are some races possible while we get from the
> > > + * error detection to actually handle it.
> > > + */
> > > +
> > > +struct to_kill {
> > > +	struct list_head nd;
> > > +	struct task_struct *tsk;
> > > +	unsigned long addr;
> > > +};
> > 
> > It would be kinda nice to have a field in task_struct that is usable
> > say for anyone holding the tasklist lock for write. Then you could
> 
> I don't want to hold the tasklist lock for writing all the time, memory
> failure handling can sleep.
> 
> > make a list with them. But I guess it isn't worthwhile unless there
> > are other users.
> 
> It would need to be reserved for this, which definitely doesn't make
> worth it. Also I need the  address too, a list head alone wouldn't be enough.

Right, it was just an idea. It would not have to be reserved for that
so long as it was synchronized with the right lock. But not a big deal,
forget it.


> > > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > > +			return;
> > > +		}
> > > +	}
> > > +	tk->addr = page_address_in_vma(p, vma);
> > > +	if (tk->addr == -EFAULT) {
> > > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> > 
> > I don't know if this is very helpful message. I could legitimately happen and
> > nothing anybody can do about it...
> 
> Can you suggest a better message?

Well, for userspace, nothing? At the very least ratelimited, and preferably
telling a more high level of what the problem and consequences are.


> > > +		tk->addr = 0;
> > > +		fail = 1;
> > 
> > Fail doesn't seem to be used anywhere.
> 
> Ah yes that was a remnant of a error checking scheme I discard later.
> I'll remove it thanks.
> 
> > > +	list_add_tail(&tk->nd, to_kill);
> > > +}
> > > +
> > > +/*
> > > + * Kill the processes that have been collected earlier.
> > > + */
> > > +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> > > +			  int fail, unsigned long pfn)
> > 
> > I guess "doit" etc is obvious once reading the code and caller, but maybe a
> > quick comment in the header to describe?
> 
> Ok.
> 
> > 
> > > +{
> > > +	struct to_kill *tk, *next;
> > > +
> > > +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> > > +		if (doit) {
> > > +			/*
> > > +			 * In case something went wrong with munmaping
> > > +			 * make sure the process doesn't catch the
> > > +			 * signal and then access the memory. So reset
> > > +			 * the signal handlers
> > > +			 */
> > > +			if (fail)
> > > +				flush_signal_handlers(tk->tsk, 1);
> > 
> > Is this a legitimate thing to do? Is it racy? Why would you not send a
> > sigkill or something if you want them to die right now?
> 
> That's a very unlikely case it could be probably just removed, when
> something during unmapping fails (mostly out of memory)
> 
> It's more paranoia than real need.
> 
> Yes SIGKILL would be probably better.

OK, maybe just remove it? (keep simple first?)


> > > + */
> > > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > > +			      struct to_kill **tkc)
> > > +{
> > > +	struct vm_area_struct *vma;
> > > +	struct task_struct *tsk;
> > > +	struct prio_tree_iter iter;
> > > +	struct address_space *mapping = page_mapping(page);
> > > +
> > > +	read_lock(&tasklist_lock);
> > > +	spin_lock(&mapping->i_mmap_lock);
> > 
> > You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> > lock. And anon_vma lock nests inside i_mmap_lock.
> > 
> > This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> > type (maybe -rt kernels do it), then you could have a task holding
> > anon_vma lock and waiting for tasklist_lock, and another holding tasklist
> > lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
> > waiting for anon_vma lock.
> 
> So you're saying I should change the order?

Well I don't _think_ we have a dependency already. Yes I would just change
the order to be either outside both VM locks or inside both. Maybe with
a note that it does not really matter which order (in case another user
comes up who needs the opposite ordering).


> > I think nesting either inside or outside these locks consistently is less
> > fragile. Do we already have a dependency?... I don't know of one, but you
> > should document this in mm/rmap.c and mm/filemap.c.
> 
> Ok.
> 
> > > +	DELAYED,
> > > +	IGNORED,
> > > +	RECOVERED,
> > > +};
> > > +
> > > +static const char *action_name[] = {
> > > +	[FAILED] = "Failed",
> > > +	[DELAYED] = "Delayed",
> > > +	[IGNORED] = "Ignored",
> > 
> > How is delayed different to ignored (or failed, for that matter)?
> 
> Part of it is documentation.
> 
> DELAYED means it's handled somewhere else (e.g. in the case of free pages)
> 
> > 
> > 
> > > +	[RECOVERED] = "Recovered",
> > 
> > And what does recovered mean? THe processes were killed and the page taken
> 
> Not necessarily killed, it might have been a clean page or so.
> 
> > out of circulation, but the machine is still in some unknown state of corruption
> > henceforth, right?
> 
> It's in a known state of corruption -- there was this error on that page
> and otherwise it's fine (or at least no errors known at this point)
> The CPU generally tells you when it's in a unknown state and in this case this 
> code is not executed, but just panic directly.

Then the data can not have been consumed, by DMA or otherwise? What
about transient kernel references to the (pagecache/anonymous) page
(such as, find_get_page for read(2), or get_user_pages callers).


> > > +
> > > +	/*
> > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > +	 */
> > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > +		remove_from_page_cache(p);
> > > +		page_cache_release(p);
> > > +	}
> > 
> > remove_mapping would probably be a better idea. Otherwise you can
> > probably introduce pagecache removal vs page fault races whi
> > will make the kernel bug.
> 
> Can you be more specific about the problems?

Hmm, actually now that we hold the page lock over __do_fault (at least
for pagecache pages), this may not be able to trigger the race I was
thinking of (page becoming mapped). But I think still it is better
to use remove_mapping which is the standard way to remove such a page.

BTW. I don't know if you are checking for PG_writeback often enough?
You can't remove a PG_writeback page from pagecache. The normal
pattern is lock_page(page); wait_on_page_writeback(page); which I
think would be safest (then you never have to bother with the writeback
bit again).


> > > +			page_to_pfn(p));
> > > +	if (mapping) {
> > > +		/*
> > > +		 * Truncate does the same, but we're not quite the same
> > > +		 * as truncate. Needs more checking, but keep it for now.
> > > +		 */
> > 
> > What's different about truncate? It would be good to reuse as much as possible.
> 
> Truncating removes the block on disk (we don't). Truncating shrinks
> the end of the file (we don't). It's more "temporal hole punch"
> Probably from the VM point of view it's very similar, but it's
> not the same.

Right, I just mean the pagecache side of the truncate. So you should
use truncate_inode_pages_range here.


> > > +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> > > +
> > > +		/*
> > > +		 * IO error will be reported by write(), fsync(), etc.
> > > +		 * who check the mapping.
> > > +		 */
> > > +		mapping_set_error(mapping, EIO);
> > 
> > Interesting. It's not *exactly* an IO error (well, not like one we're usually
> > used to).
> 
> It's a new kind, but conceptually it's the same. Dirty IO data got corrupted.

Well, the dirty data has never been corrupted before (ie. the data
in pagecache has been OK). It was just unable to make it back to
backing store. So a program could retry the write/fsync/etc or
try to write the data somewhere else.

It kind of wants a new error code, but I can't imagine the difficulty
in doing that...


> We actually had a lot of grief with the error reporting; a lot of
> code does "report error once then clear from mapping", which
> broke all the tests for that in the test suite. IMHO that's a shady
> area in the kernel.

Yeah, it's annoying. I ran over this problem when auditing some
data integrity problems in the kernel recently. IIRC even a
misplaced sync from another process can come and suck up the IO
error that your DBMS was expecting to get from fsync.

We kind of need another type of syscall to tell the kernel whether
it can discard errors for a given file/range. But this is not the
problem for your patch.


> Right now these are "expected but incorrect failures" in the tester.
> 
> 
> > > +
> > > +	delete_from_swap_cache(p);
> > > +
> > > +	return RECOVERED;
> > > +}
> > 
> > All these handlers are quite interesting in that they need to
> > know about most of the mm. What are you trying to do in each
> > of them would be a good idea to say, and probably they should
> > rather go into their appropriate files instead of all here
> > (eg. swapcache stuff should go in mm/swap_state for example).
> 
> Hmm. I think I would prefer to first merge before
> thinking about such things. But they could be moved at some 
> point.
> 
> I suspect people first need to get more used to the idea of poisoned pages
> before we can force it to them directly like this.

Well these are messing with the internals of those subsystems, so
maintainers etc do need to think about such things and get used
to the idea.

I think it is actually a good idea thinking about it more. Basically
each subsystem will just have calls in response to handle errors in
their pages.


> > You haven't waited on writeback here AFAIKS, and have you
> > *really* verified it is safe to call delete_from_swap_cache?
> 
> Verified in what way? me and Fengguang went over the code.
> The original attempt at doing this was quite broken, but this
> one should be better (it's the third iteration or so)

I was thinking maybe it can be PG_writeback at that point which
would go BUG there. Maybe I missed somewhere where you filter
that out.


> > > +
> > > +#define dirty		(1UL << PG_dirty)
> > > +#define swapcache	(1UL << PG_swapcache)
> > > +#define unevict		(1UL << PG_unevictable)
> > > +#define mlocked		(1UL << PG_mlocked)
> > > +#define writeback	(1UL << PG_writeback)
> > > +#define lru		(1UL << PG_lru)
> > > +#define swapbacked	(1UL << PG_swapbacked)
> > > +#define head		(1UL << PG_head)
> > > +#define tail		(1UL << PG_tail)
> > > +#define compound	(1UL << PG_compound)
> > > +#define slab		(1UL << PG_slab)
> > > +#define buddy		(1UL << PG_buddy)
> > > +#define reserved	(1UL << PG_reserved)
> > 
> > This looks like more work than just putting 1UL << (...) in each entry
> 
> I had this originally, but it looked rather ugly.
> 
> > in your table. Hmm, does this whole table thing even buy you much (versus a
> > much simpler switch statement?)
> 
> I don't think the switch would be particularly simple. Also I like
> tables.
> 
> > 
> > And seeing as you are doing a lot of checking for various page flags anyway,
> > (eg. in your prepare function). Just seems like needless complexity.
> 
> Yes that grew over time unfortunately. Originally there was very little
> explicit flag checking.
> 
> I still think the table is a good approach. 

Just seems overengineered. We could rewrite any if/switch statement like
that (and actually the compiler probably will if it is beneficial).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 10:51             ` Andi Kleen
@ 2009-05-28 12:15               ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 12:51:03PM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 06:33:00PM +0800, Wu Fengguang wrote:
> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > > 
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > > 
> > > Yes, we assume the IO device does something sensible with the poisoned
> > > cache lines and aborts. Later we can likely abort IO requests in a early
> > > stage on the Linux, but that's more advanced.
> > > 
> > > The question is if we need to wait on writeback for correctness? 
> > 
> > Not necessary. Because I'm going to add a me_writeback() handler.
> 
> Ok but without it. Let's assume me_writeback() is in the future.

For correctness for what? You can't remove a page from swapcache or
pagecache under writeback because then the mm thinks that location
is not being used.

 
> I'm mainly interested in correctness (as in not crashing) of this
> version now.
> 
> Also writeback seems to be only used by nfs/afs/nilfs2, not in
> the normal case, unless I'm misreading the code. 

I don't follow. What writeback are you talking about?

> 
> The nilfs2 case seems weird, I haven't completely read that.
> 
> > Then the writeback pages simply won't reach here. And it won't
> > magically go into writeback state, since the page has been locked.
> 
> But since we take the page lock they should not be in writeback anyways,
> no?

No. PG_writeback was introduced so as to reduce page lock hold
times (most of writeback runs without page lock held).


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 12:15               ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 12:51:03PM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 06:33:00PM +0800, Wu Fengguang wrote:
> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > > 
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > > 
> > > Yes, we assume the IO device does something sensible with the poisoned
> > > cache lines and aborts. Later we can likely abort IO requests in a early
> > > stage on the Linux, but that's more advanced.
> > > 
> > > The question is if we need to wait on writeback for correctness? 
> > 
> > Not necessary. Because I'm going to add a me_writeback() handler.
> 
> Ok but without it. Let's assume me_writeback() is in the future.

For correctness for what? You can't remove a page from swapcache or
pagecache under writeback because then the mm thinks that location
is not being used.

 
> I'm mainly interested in correctness (as in not crashing) of this
> version now.
> 
> Also writeback seems to be only used by nfs/afs/nilfs2, not in
> the normal case, unless I'm misreading the code. 

I don't follow. What writeback are you talking about?

> 
> The nilfs2 case seems weird, I haven't completely read that.
> 
> > Then the writeback pages simply won't reach here. And it won't
> > magically go into writeback state, since the page has been locked.
> 
> But since we take the page lock they should not be in writeback anyways,
> no?

No. PG_writeback was introduced so as to reduce page lock hold
times (most of writeback runs without page lock held).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28  9:59       ` Wu Fengguang
@ 2009-05-28 12:23         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> Hi Nick,
> 
> > > +     /*
> > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > +      */
> > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > +             remove_from_page_cache(p);
> > > +             page_cache_release(p);
> > > +     }
> > 
> > remove_mapping would probably be a better idea. Otherwise you can
> > probably introduce pagecache removal vs page fault races which
> > will make the kernel bug.
> 
> We use remove_mapping() at first, then discovered that it made strong
> assumption on page_count=2.
> 
> I guess it is safe from races since we are locking the page?

Yes it probably should (although you will lose get_user_pages data, but
I guess that's the aim anyway).

But I just don't like this one file having all that required knowledge
(and few comments) of all the files in mm/. If you want to get rid
of the page and don't care what it's count or dirtyness is, then
truncate_inode_pages_range is the correct API to use.

(or you could extract out some of it so you can call it directly on
individual locked pages, if that helps).


> > > +     }
> > > +
> > > +     me_pagecache_clean(p);
> > > +
> > > +     /*
> > > +      * Did the earlier release work?
> > > +      */
> > > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > > +             return FAILED;
> > > +
> > > +     return RECOVERED;
> > > +}
> > > +
> > > +/*
> > > + * Clean and dirty swap cache.
> > > + */
> > > +static int me_swapcache_dirty(struct page *p)
> > > +{
> > > +     ClearPageDirty(p);
> > > +
> > > +     if (!isolate_lru_page(p))
> > > +             page_cache_release(p);
> > > +
> > > +     return DELAYED;
> > > +}
> > > +
> > > +static int me_swapcache_clean(struct page *p)
> > > +{
> > > +     ClearPageUptodate(p);
> > > +
> > > +     if (!isolate_lru_page(p))
> > > +             page_cache_release(p);
> > > +
> > > +     delete_from_swap_cache(p);
> > > +
> > > +     return RECOVERED;
> > > +}
> > 
> > All these handlers are quite interesting in that they need to
> > know about most of the mm. What are you trying to do in each
> > of them would be a good idea to say, and probably they should
> > rather go into their appropriate files instead of all here
> > (eg. swapcache stuff should go in mm/swap_state for example).
> 
> Yup, they at least need more careful comments.
> 
> Dirty swap cache page is tricky to handle. The page could live both in page
> cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> the normal PTEs to swap PTEs, and then
>         - clear dirty bit to prevent IO
>         - remove from LRU
>         - but keep in the swap cache, so that when we return to it on
>           a later page fault, we know the application is accessing
>           corrupted data and shall be killed (we installed simple
>           interception code in do_swap_page to catch it).

OK this is the point I was missing.

Should all be commented and put into mm/swap_state.c (or somewhere that
Hugh prefers).
 

> Clean swap cache pages can be directly isolated. A later page fault will bring
> in the known good data from disk.

OK, but why do you ClearPageUptodate if it is just to be deleted from
swapcache anyway?

 
> > You haven't waited on writeback here AFAIKS, and have you
> > *really* verified it is safe to call delete_from_swap_cache?
> 
> Good catch. I'll soon submit patches for handling the under
> read/write IO pages. In this patchset they are simply ignored.

Well that's quite important ;) I would suggest you just wait_on_page_writeback.
It is simple and should work. _Unless_ you can show it is a big problem that
needs equivalently big mes to fix ;)


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 12:23         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> Hi Nick,
> 
> > > +     /*
> > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > +      */
> > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > +             remove_from_page_cache(p);
> > > +             page_cache_release(p);
> > > +     }
> > 
> > remove_mapping would probably be a better idea. Otherwise you can
> > probably introduce pagecache removal vs page fault races which
> > will make the kernel bug.
> 
> We use remove_mapping() at first, then discovered that it made strong
> assumption on page_count=2.
> 
> I guess it is safe from races since we are locking the page?

Yes it probably should (although you will lose get_user_pages data, but
I guess that's the aim anyway).

But I just don't like this one file having all that required knowledge
(and few comments) of all the files in mm/. If you want to get rid
of the page and don't care what it's count or dirtyness is, then
truncate_inode_pages_range is the correct API to use.

(or you could extract out some of it so you can call it directly on
individual locked pages, if that helps).


> > > +     }
> > > +
> > > +     me_pagecache_clean(p);
> > > +
> > > +     /*
> > > +      * Did the earlier release work?
> > > +      */
> > > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > > +             return FAILED;
> > > +
> > > +     return RECOVERED;
> > > +}
> > > +
> > > +/*
> > > + * Clean and dirty swap cache.
> > > + */
> > > +static int me_swapcache_dirty(struct page *p)
> > > +{
> > > +     ClearPageDirty(p);
> > > +
> > > +     if (!isolate_lru_page(p))
> > > +             page_cache_release(p);
> > > +
> > > +     return DELAYED;
> > > +}
> > > +
> > > +static int me_swapcache_clean(struct page *p)
> > > +{
> > > +     ClearPageUptodate(p);
> > > +
> > > +     if (!isolate_lru_page(p))
> > > +             page_cache_release(p);
> > > +
> > > +     delete_from_swap_cache(p);
> > > +
> > > +     return RECOVERED;
> > > +}
> > 
> > All these handlers are quite interesting in that they need to
> > know about most of the mm. What are you trying to do in each
> > of them would be a good idea to say, and probably they should
> > rather go into their appropriate files instead of all here
> > (eg. swapcache stuff should go in mm/swap_state for example).
> 
> Yup, they at least need more careful comments.
> 
> Dirty swap cache page is tricky to handle. The page could live both in page
> cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> the normal PTEs to swap PTEs, and then
>         - clear dirty bit to prevent IO
>         - remove from LRU
>         - but keep in the swap cache, so that when we return to it on
>           a later page fault, we know the application is accessing
>           corrupted data and shall be killed (we installed simple
>           interception code in do_swap_page to catch it).

OK this is the point I was missing.

Should all be commented and put into mm/swap_state.c (or somewhere that
Hugh prefers).
 

> Clean swap cache pages can be directly isolated. A later page fault will bring
> in the known good data from disk.

OK, but why do you ClearPageUptodate if it is just to be deleted from
swapcache anyway?

 
> > You haven't waited on writeback here AFAIKS, and have you
> > *really* verified it is safe to call delete_from_swap_cache?
> 
> Good catch. I'll soon submit patches for handling the under
> read/write IO pages. In this patchset they are simply ignored.

Well that's quite important ;) I would suggest you just wait_on_page_writeback.
It is simple and should work. _Unless_ you can show it is a big problem that
needs equivalently big mes to fix ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-28  9:02           ` Andi Kleen
@ 2009-05-28 12:26             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 11:02:41AM +0200, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > There are a set of "actions" which is what the callers are, then a
> > set of modifiers. Just make it all modifiers and the callers can
> > use things that are | together.
> 
> The actions are typically contradictory in some way, that is why
> I made them "actions". The modifiers are all things that could
> be made into flags in a straightforward way.
> 
> Probably it could be all turned into flags, but that would
> make the patch much more intrusive for rmap.c than it currently is,
> with some restructuring needed, which I didn't want to do.

I don't think that's a problem. It's ugly as-is.

 
> Hwpoison in general is designed to not be intrusive.

Some cosmetic or code restructuring is the least intrusiveness that
hwpoison is. It is very intrusive, not for lines added or changed,
but for how it interacts with the mm.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
@ 2009-05-28 12:26             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-05-28 12:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 11:02:41AM +0200, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > There are a set of "actions" which is what the callers are, then a
> > set of modifiers. Just make it all modifiers and the callers can
> > use things that are | together.
> 
> The actions are typically contradictory in some way, that is why
> I made them "actions". The modifiers are all things that could
> be made into flags in a straightforward way.
> 
> Probably it could be all turned into flags, but that would
> make the patch much more intrusive for rmap.c than it currently is,
> with some restructuring needed, which I didn't want to do.

I don't think that's a problem. It's ugly as-is.

 
> Hwpoison in general is designed to not be intrusive.

Some cosmetic or code restructuring is the least intrusiveness that
hwpoison is. It is very intrusive, not for lines added or changed,
but for how it interacts with the mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 12:08         ` Nick Piggin
@ 2009-05-28 13:45           ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 13:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > > > +			return;
> > > > +		}
> > > > +	}
> > > > +	tk->addr = page_address_in_vma(p, vma);
> > > > +	if (tk->addr == -EFAULT) {
> > > > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> > > 
> > > I don't know if this is very helpful message. I could legitimately happen and
> > > nothing anybody can do about it...
> > 
> > Can you suggest a better message?
> 
> Well, for userspace, nothing? At the very least ratelimited, and preferably
> telling a more high level of what the problem and consequences are.

I changed it to 

 "MCE: Unable to determine user space address during error handling\n")

Still not perfect, but hopefully better.


> > > > +				flush_signal_handlers(tk->tsk, 1);
> > > 
> > > Is this a legitimate thing to do? Is it racy? Why would you not send a
> > > sigkill or something if you want them to die right now?
> > 
> > That's a very unlikely case it could be probably just removed, when
> > something during unmapping fails (mostly out of memory)
> > 
> > It's more paranoia than real need.
> > 
> > Yes SIGKILL would be probably better.
> 
> OK, maybe just remove it? (keep simple first?)

I changed it to always do a SIGKILL

> > > You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> > > lock. And anon_vma lock nests inside i_mmap_lock.
> > > 
> > > This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> > > type (maybe -rt kernels do it), then you could have a task holding
> > > anon_vma lock and waiting for tasklist_lock, and another holding tasklist
> > > lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
> > > waiting for anon_vma lock.
> > 
> > So you're saying I should change the order?
> 
> Well I don't _think_ we have a dependency already. Yes I would just change
> the order to be either outside both VM locks or inside both. Maybe with
> a note that it does not really matter which order (in case another user
> comes up who needs the opposite ordering).

Ok. I can add a comment.

> > > > +	[RECOVERED] = "Recovered",
> > > 
> > > And what does recovered mean? THe processes were killed and the page taken
> > 
> > Not necessarily killed, it might have been a clean page or so.
> > 
> > > out of circulation, but the machine is still in some unknown state of corruption
> > > henceforth, right?
> > 
> > It's in a known state of corruption -- there was this error on that page
> > and otherwise it's fine (or at least no errors known at this point)
> > The CPU generally tells you when it's in a unknown state and in this case this 
> > code is not executed, but just panic directly.
> 
> Then the data can not have been consumed, by DMA or otherwise? What

When the data was consumed we get a different machine check
(or a different error if it was consumed by a IO device)

This code right now just handles the case of "CPU detected a page is broken
is wrong, but hasn't consumed it yet"

> about transient kernel references to the (pagecache/anonymous) page
> (such as, find_get_page for read(2), or get_user_pages callers).	

There are always races, after all the CPU could be just about 
right now to consume. If we lose the race there will be just
another machine check that stops the consumption of bad data.
The hardware takes care of that.

The code here doesn't try to be a 100% coverage of all
cases (that's obviously impossible), just to handle
common page types. I also originally had ideas for more handlers,
but found out how hard it is to test, so I burried a lot of fancy
ideas :-)

If there are left over references we complain at least.

> > > > +	/*
> > > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > > +	 */
> > > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > > +		remove_from_page_cache(p);
> > > > +		page_cache_release(p);
> > > > +	}
> > > 
> > > remove_mapping would probably be a better idea. Otherwise you can
> > > probably introduce pagecache removal vs page fault races whi
> > > will make the kernel bug.
> > 
> > Can you be more specific about the problems?
> 
> Hmm, actually now that we hold the page lock over __do_fault (at least
> for pagecache pages), this may not be able to trigger the race I was
> thinking of (page becoming mapped). But I think still it is better
> to use remove_mapping which is the standard way to remove such a page.

I had this originally, but Fengguang redid it because there was
trouble with the reference count. remove_mapping always expects it to
be 2, which we cannot guarantee.

> 
> BTW. I don't know if you are checking for PG_writeback often enough?
> You can't remove a PG_writeback page from pagecache. The normal
> pattern is lock_page(page); wait_on_page_writeback(page); which I

So pages can be in writeback without being locked? I still
wasn't able to find such a case (in fact unless I'm misreading
the code badly the writeback bit is only used by NFS and a few  
obscure cases)

> think would be safest 

Okay. I'll just add it after the page lock.

> (then you never have to bother with the writeback bit again)

Until Fengguang does something fancy with it.

> > > > +	if (mapping) {
> > > > +		/*
> > > > +		 * Truncate does the same, but we're not quite the same
> > > > +		 * as truncate. Needs more checking, but keep it for now.
> > > > +		 */
> > > 
> > > What's different about truncate? It would be good to reuse as much as possible.
> > 
> > Truncating removes the block on disk (we don't). Truncating shrinks
> > the end of the file (we don't). It's more "temporal hole punch"
> > Probably from the VM point of view it's very similar, but it's
> > not the same.
> 
> Right, I just mean the pagecache side of the truncate. So you should
> use truncate_inode_pages_range here.

Why?  I remember I was trying to use that function very early on but
there was some problem.  For once it does its own locking which
would conflict with ours.

Also we already do a lot of the stuff it does (like unmapping).

Is there anything concretely wrong with the current code?

> > > > +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> > > > +
> > > > +		/*
> > > > +		 * IO error will be reported by write(), fsync(), etc.
> > > > +		 * who check the mapping.
> > > > +		 */
> > > > +		mapping_set_error(mapping, EIO);
> > > 
> > > Interesting. It's not *exactly* an IO error (well, not like one we're usually
> > > used to).
> > 
> > It's a new kind, but conceptually it's the same. Dirty IO data got corrupted.
> 
> Well, the dirty data has never been corrupted before (ie. the data
> in pagecache has been OK). It was just unable to make it back to
> backing store. So a program could retry the write/fsync/etc or
> try to write the data somewhere else.

In theory it could, but in practice it is very unlikely it would.

> It kind of wants a new error code, but I can't imagine the difficulty
> in doing that...

I don't think it's a good idea to change programs for this normally,
most wouldn't anyways. Even kernel programmers have trouble with
memory suddenly going bad, for user space programmers it would
be pure voodoo.

The only special case is the new fancy SIGBUS, that was mainly done
for forwarding proper machine checks to KVM guests. In theory clever
programs could take advance of that.

> > We actually had a lot of grief with the error reporting; a lot of
> > code does "report error once then clear from mapping", which
> > broke all the tests for that in the test suite. IMHO that's a shady
> > area in the kernel.
> 
> Yeah, it's annoying. I ran over this problem when auditing some
> data integrity problems in the kernel recently. IIRC even a
> misplaced sync from another process can come and suck up the IO
> error that your DBMS was expecting to get from fsync.
> 
> We kind of need another type of syscall to tell the kernel whether
> it can discard errors for a given file/range. But this is not the
> problem for your patch.

Yes, I decided to not try to address that.

> > I don't think the switch would be particularly simple. Also I like
> > tables.
> > 
> > > 
> > > And seeing as you are doing a lot of checking for various page flags anyway,
> > > (eg. in your prepare function). Just seems like needless complexity.
> > 
> > Yes that grew over time unfortunately. Originally there was very little
> > explicit flag checking.
> > 
> > I still think the table is a good approach. 
> 
> Just seems overengineered. We could rewrite any if/switch statement like
> that (and actually the compiler probably will if it is beneficial).

The reason I like it is that it separates the functions cleanly,
without that there would be a dispatcher from hell. Yes it's a bit
ugly that there is a lot of manual bit checking around now too,
but as you go into all the corner cases originally clean code
always tends to get more ugly (and this is a really ugly problem)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 13:45           ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 13:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > > > +			return;
> > > > +		}
> > > > +	}
> > > > +	tk->addr = page_address_in_vma(p, vma);
> > > > +	if (tk->addr == -EFAULT) {
> > > > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> > > 
> > > I don't know if this is very helpful message. I could legitimately happen and
> > > nothing anybody can do about it...
> > 
> > Can you suggest a better message?
> 
> Well, for userspace, nothing? At the very least ratelimited, and preferably
> telling a more high level of what the problem and consequences are.

I changed it to 

 "MCE: Unable to determine user space address during error handling\n")

Still not perfect, but hopefully better.


> > > > +				flush_signal_handlers(tk->tsk, 1);
> > > 
> > > Is this a legitimate thing to do? Is it racy? Why would you not send a
> > > sigkill or something if you want them to die right now?
> > 
> > That's a very unlikely case it could be probably just removed, when
> > something during unmapping fails (mostly out of memory)
> > 
> > It's more paranoia than real need.
> > 
> > Yes SIGKILL would be probably better.
> 
> OK, maybe just remove it? (keep simple first?)

I changed it to always do a SIGKILL

> > > You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> > > lock. And anon_vma lock nests inside i_mmap_lock.
> > > 
> > > This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> > > type (maybe -rt kernels do it), then you could have a task holding
> > > anon_vma lock and waiting for tasklist_lock, and another holding tasklist
> > > lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
> > > waiting for anon_vma lock.
> > 
> > So you're saying I should change the order?
> 
> Well I don't _think_ we have a dependency already. Yes I would just change
> the order to be either outside both VM locks or inside both. Maybe with
> a note that it does not really matter which order (in case another user
> comes up who needs the opposite ordering).

Ok. I can add a comment.

> > > > +	[RECOVERED] = "Recovered",
> > > 
> > > And what does recovered mean? THe processes were killed and the page taken
> > 
> > Not necessarily killed, it might have been a clean page or so.
> > 
> > > out of circulation, but the machine is still in some unknown state of corruption
> > > henceforth, right?
> > 
> > It's in a known state of corruption -- there was this error on that page
> > and otherwise it's fine (or at least no errors known at this point)
> > The CPU generally tells you when it's in a unknown state and in this case this 
> > code is not executed, but just panic directly.
> 
> Then the data can not have been consumed, by DMA or otherwise? What

When the data was consumed we get a different machine check
(or a different error if it was consumed by a IO device)

This code right now just handles the case of "CPU detected a page is broken
is wrong, but hasn't consumed it yet"

> about transient kernel references to the (pagecache/anonymous) page
> (such as, find_get_page for read(2), or get_user_pages callers).	

There are always races, after all the CPU could be just about 
right now to consume. If we lose the race there will be just
another machine check that stops the consumption of bad data.
The hardware takes care of that.

The code here doesn't try to be a 100% coverage of all
cases (that's obviously impossible), just to handle
common page types. I also originally had ideas for more handlers,
but found out how hard it is to test, so I burried a lot of fancy
ideas :-)

If there are left over references we complain at least.

> > > > +	/*
> > > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > > +	 */
> > > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > > +		remove_from_page_cache(p);
> > > > +		page_cache_release(p);
> > > > +	}
> > > 
> > > remove_mapping would probably be a better idea. Otherwise you can
> > > probably introduce pagecache removal vs page fault races whi
> > > will make the kernel bug.
> > 
> > Can you be more specific about the problems?
> 
> Hmm, actually now that we hold the page lock over __do_fault (at least
> for pagecache pages), this may not be able to trigger the race I was
> thinking of (page becoming mapped). But I think still it is better
> to use remove_mapping which is the standard way to remove such a page.

I had this originally, but Fengguang redid it because there was
trouble with the reference count. remove_mapping always expects it to
be 2, which we cannot guarantee.

> 
> BTW. I don't know if you are checking for PG_writeback often enough?
> You can't remove a PG_writeback page from pagecache. The normal
> pattern is lock_page(page); wait_on_page_writeback(page); which I

So pages can be in writeback without being locked? I still
wasn't able to find such a case (in fact unless I'm misreading
the code badly the writeback bit is only used by NFS and a few  
obscure cases)

> think would be safest 

Okay. I'll just add it after the page lock.

> (then you never have to bother with the writeback bit again)

Until Fengguang does something fancy with it.

> > > > +	if (mapping) {
> > > > +		/*
> > > > +		 * Truncate does the same, but we're not quite the same
> > > > +		 * as truncate. Needs more checking, but keep it for now.
> > > > +		 */
> > > 
> > > What's different about truncate? It would be good to reuse as much as possible.
> > 
> > Truncating removes the block on disk (we don't). Truncating shrinks
> > the end of the file (we don't). It's more "temporal hole punch"
> > Probably from the VM point of view it's very similar, but it's
> > not the same.
> 
> Right, I just mean the pagecache side of the truncate. So you should
> use truncate_inode_pages_range here.

Why?  I remember I was trying to use that function very early on but
there was some problem.  For once it does its own locking which
would conflict with ours.

Also we already do a lot of the stuff it does (like unmapping).

Is there anything concretely wrong with the current code?

> > > > +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> > > > +
> > > > +		/*
> > > > +		 * IO error will be reported by write(), fsync(), etc.
> > > > +		 * who check the mapping.
> > > > +		 */
> > > > +		mapping_set_error(mapping, EIO);
> > > 
> > > Interesting. It's not *exactly* an IO error (well, not like one we're usually
> > > used to).
> > 
> > It's a new kind, but conceptually it's the same. Dirty IO data got corrupted.
> 
> Well, the dirty data has never been corrupted before (ie. the data
> in pagecache has been OK). It was just unable to make it back to
> backing store. So a program could retry the write/fsync/etc or
> try to write the data somewhere else.

In theory it could, but in practice it is very unlikely it would.

> It kind of wants a new error code, but I can't imagine the difficulty
> in doing that...

I don't think it's a good idea to change programs for this normally,
most wouldn't anyways. Even kernel programmers have trouble with
memory suddenly going bad, for user space programmers it would
be pure voodoo.

The only special case is the new fancy SIGBUS, that was mainly done
for forwarding proper machine checks to KVM guests. In theory clever
programs could take advance of that.

> > We actually had a lot of grief with the error reporting; a lot of
> > code does "report error once then clear from mapping", which
> > broke all the tests for that in the test suite. IMHO that's a shady
> > area in the kernel.
> 
> Yeah, it's annoying. I ran over this problem when auditing some
> data integrity problems in the kernel recently. IIRC even a
> misplaced sync from another process can come and suck up the IO
> error that your DBMS was expecting to get from fsync.
> 
> We kind of need another type of syscall to tell the kernel whether
> it can discard errors for a given file/range. But this is not the
> problem for your patch.

Yes, I decided to not try to address that.

> > I don't think the switch would be particularly simple. Also I like
> > tables.
> > 
> > > 
> > > And seeing as you are doing a lot of checking for various page flags anyway,
> > > (eg. in your prepare function). Just seems like needless complexity.
> > 
> > Yes that grew over time unfortunately. Originally there was very little
> > explicit flag checking.
> > 
> > I still think the table is a good approach. 
> 
> Just seems overengineered. We could rewrite any if/switch statement like
> that (and actually the compiler probably will if it is beneficial).

The reason I like it is that it separates the functions cleanly,
without that there would be a dispatcher from hell. Yes it's a bit
ugly that there is a lot of manual bit checking around now too,
but as you go into all the corner cases originally clean code
always tends to get more ugly (and this is a really ugly problem)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 12:15               ` Nick Piggin
@ 2009-05-28 13:48                 ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 13:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 02:15:41PM +0200, Nick Piggin wrote:
> For correctness for what? You can't remove a page from swapcache or
> pagecache under writeback because then the mm thinks that location
> is not being used.

I'm adding wait_on_page_writeback() to memory_failure(), so 
it will be out of the picture hopefully

> 
>  
> > I'm mainly interested in correctness (as in not crashing) of this
> > version now.
> > 
> > Also writeback seems to be only used by nfs/afs/nilfs2, not in
> > the normal case, unless I'm misreading the code. 
> 
> I don't follow. What writeback are you talking about?

Sorry I misread the code, it's indeed used more commonly.


> > > Then the writeback pages simply won't reach here. And it won't
> > > magically go into writeback state, since the page has been locked.
> > 
> > But since we take the page lock they should not be in writeback anyways,
> > no?
> 
> No. PG_writeback was introduced so as to reduce page lock hold
> times (most of writeback runs without page lock held).

Ok. Then the wait_on_page_writeback() will take care of that.

Thanks for the feedback,

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 13:48                 ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-28 13:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 02:15:41PM +0200, Nick Piggin wrote:
> For correctness for what? You can't remove a page from swapcache or
> pagecache under writeback because then the mm thinks that location
> is not being used.

I'm adding wait_on_page_writeback() to memory_failure(), so 
it will be out of the picture hopefully

> 
>  
> > I'm mainly interested in correctness (as in not crashing) of this
> > version now.
> > 
> > Also writeback seems to be only used by nfs/afs/nilfs2, not in
> > the normal case, unless I'm misreading the code. 
> 
> I don't follow. What writeback are you talking about?

Sorry I misread the code, it's indeed used more commonly.


> > > Then the writeback pages simply won't reach here. And it won't
> > > magically go into writeback state, since the page has been locked.
> > 
> > But since we take the page lock they should not be in writeback anyways,
> > no?
> 
> No. PG_writeback was introduced so as to reduce page lock hold
> times (most of writeback runs without page lock held).

Ok. Then the wait_on_page_writeback() will take care of that.

Thanks for the feedback,

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 12:23         ` Nick Piggin
@ 2009-05-28 13:54           ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 13:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > Hi Nick,
> > 
> > > > +     /*
> > > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > > +      */
> > > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > > +             remove_from_page_cache(p);
> > > > +             page_cache_release(p);
> > > > +     }
> > > 
> > > remove_mapping would probably be a better idea. Otherwise you can
> > > probably introduce pagecache removal vs page fault races which
> > > will make the kernel bug.
> > 
> > We use remove_mapping() at first, then discovered that it made strong
> > assumption on page_count=2.
> > 
> > I guess it is safe from races since we are locking the page?
> 
> Yes it probably should (although you will lose get_user_pages data, but
> I guess that's the aim anyway).

Yes. We (and truncate) rely heavily on this logic:

        retry:
                lock_page(page);
                if (page->mapping == NULL)
                        goto retry;
                // do something on page
                unlock_page(page);

So that we can steal/isolate a page under its page lock.

The truncate code does wait on writeback page, but we would like to
isolate the page ASAP, so as to avoid someone to find it in the page
cache (or swap cache) and then access its content.

I see no obvious problems to isolate a writeback page from page cache
or swap cache. But also I'm not sure it won't break some assumption
in some corner of the kernel.

> But I just don't like this one file having all that required knowledge

Yes that's a big problem.

One major complexity involves classify the page into different known
types, by testing page flags, page_mapping, page_mapped, etc. This
is not avoidable.

Another major complexity is on calling the isolation routines to
remove references from
        - PTE
        - page cache
        - swap cache
        - LRU list
They more or less made some assumptions on their operating environment
that we have to take care of.  Unfortunately these complexities are
also not easily resolvable.

> (and few comments) of all the files in mm/. If you want to get rid

I promise I'll add more comments :)

> of the page and don't care what it's count or dirtyness is, then
> truncate_inode_pages_range is the correct API to use.
>
> (or you could extract out some of it so you can call it directly on
> individual locked pages, if that helps).
 
The patch to move over to truncate_complete_page() would like this.
It's not a big win indeed.

---
 mm/memory-failure.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
-	if (page_has_private(p))
-		do_invalidatepage(p, 0);
-	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
-		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
-			page_to_pfn(p));
-
 	/*
 	 * remove_from_page_cache assumes (mapping && !mapped)
 	 */
 	if (page_mapping(p) && !page_mapped(p)) {
-		remove_from_page_cache(p);
-		page_cache_release(p);
+                ClearPageMlocked(p);
+                truncate_complete_page(p->mapping, p)
 	}
 
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			page_to_pfn(p));
+
 	return RECOVERED;
 }
 

> > > > +     }
> > > > +
> > > > +     me_pagecache_clean(p);
> > > > +
> > > > +     /*
> > > > +      * Did the earlier release work?
> > > > +      */
> > > > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > > > +             return FAILED;
> > > > +
> > > > +     return RECOVERED;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Clean and dirty swap cache.
> > > > + */
> > > > +static int me_swapcache_dirty(struct page *p)
> > > > +{
> > > > +     ClearPageDirty(p);
> > > > +
> > > > +     if (!isolate_lru_page(p))
> > > > +             page_cache_release(p);
> > > > +
> > > > +     return DELAYED;
> > > > +}
> > > > +
> > > > +static int me_swapcache_clean(struct page *p)
> > > > +{
> > > > +     ClearPageUptodate(p);
> > > > +
> > > > +     if (!isolate_lru_page(p))
> > > > +             page_cache_release(p);
> > > > +
> > > > +     delete_from_swap_cache(p);
> > > > +
> > > > +     return RECOVERED;
> > > > +}
> > > 
> > > All these handlers are quite interesting in that they need to
> > > know about most of the mm. What are you trying to do in each
> > > of them would be a good idea to say, and probably they should
> > > rather go into their appropriate files instead of all here
> > > (eg. swapcache stuff should go in mm/swap_state for example).
> > 
> > Yup, they at least need more careful comments.
> > 
> > Dirty swap cache page is tricky to handle. The page could live both in page
> > cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> > concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> > handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> > the normal PTEs to swap PTEs, and then
> >         - clear dirty bit to prevent IO
> >         - remove from LRU
> >         - but keep in the swap cache, so that when we return to it on
> >           a later page fault, we know the application is accessing
> >           corrupted data and shall be killed (we installed simple
> >           interception code in do_swap_page to catch it).
> 
> OK this is the point I was missing.
> 
> Should all be commented and put into mm/swap_state.c (or somewhere that
> Hugh prefers).

But I doubt Hugh will welcome moving that bits into swap*.c ;)

> 
> > Clean swap cache pages can be directly isolated. A later page fault will bring
> > in the known good data from disk.
> 
> OK, but why do you ClearPageUptodate if it is just to be deleted from
> swapcache anyway?

The ClearPageUptodate() is kind of a careless addition, in the hope
that it will stop some random readers. Need more investigations.

> > > You haven't waited on writeback here AFAIKS, and have you
> > > *really* verified it is safe to call delete_from_swap_cache?
> > 
> > Good catch. I'll soon submit patches for handling the under
> > read/write IO pages. In this patchset they are simply ignored.
> 
> Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> It is simple and should work. _Unless_ you can show it is a big problem that
> needs equivalently big mes to fix ;)

Yes we could do wait_on_page_writeback() if necessary. The downside is,
keeping writeback page in page cache opens a small time window for
some one to access the page.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 13:54           ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 13:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > Hi Nick,
> > 
> > > > +     /*
> > > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > > +      */
> > > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > > +             remove_from_page_cache(p);
> > > > +             page_cache_release(p);
> > > > +     }
> > > 
> > > remove_mapping would probably be a better idea. Otherwise you can
> > > probably introduce pagecache removal vs page fault races which
> > > will make the kernel bug.
> > 
> > We use remove_mapping() at first, then discovered that it made strong
> > assumption on page_count=2.
> > 
> > I guess it is safe from races since we are locking the page?
> 
> Yes it probably should (although you will lose get_user_pages data, but
> I guess that's the aim anyway).

Yes. We (and truncate) rely heavily on this logic:

        retry:
                lock_page(page);
                if (page->mapping == NULL)
                        goto retry;
                // do something on page
                unlock_page(page);

So that we can steal/isolate a page under its page lock.

The truncate code does wait on writeback page, but we would like to
isolate the page ASAP, so as to avoid someone to find it in the page
cache (or swap cache) and then access its content.

I see no obvious problems to isolate a writeback page from page cache
or swap cache. But also I'm not sure it won't break some assumption
in some corner of the kernel.

> But I just don't like this one file having all that required knowledge

Yes that's a big problem.

One major complexity involves classify the page into different known
types, by testing page flags, page_mapping, page_mapped, etc. This
is not avoidable.

Another major complexity is on calling the isolation routines to
remove references from
        - PTE
        - page cache
        - swap cache
        - LRU list
They more or less made some assumptions on their operating environment
that we have to take care of.  Unfortunately these complexities are
also not easily resolvable.

> (and few comments) of all the files in mm/. If you want to get rid

I promise I'll add more comments :)

> of the page and don't care what it's count or dirtyness is, then
> truncate_inode_pages_range is the correct API to use.
>
> (or you could extract out some of it so you can call it directly on
> individual locked pages, if that helps).
 
The patch to move over to truncate_complete_page() would like this.
It's not a big win indeed.

---
 mm/memory-failure.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
-	if (page_has_private(p))
-		do_invalidatepage(p, 0);
-	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
-		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
-			page_to_pfn(p));
-
 	/*
 	 * remove_from_page_cache assumes (mapping && !mapped)
 	 */
 	if (page_mapping(p) && !page_mapped(p)) {
-		remove_from_page_cache(p);
-		page_cache_release(p);
+                ClearPageMlocked(p);
+                truncate_complete_page(p->mapping, p)
 	}
 
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			page_to_pfn(p));
+
 	return RECOVERED;
 }
 

> > > > +     }
> > > > +
> > > > +     me_pagecache_clean(p);
> > > > +
> > > > +     /*
> > > > +      * Did the earlier release work?
> > > > +      */
> > > > +     if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > > > +             return FAILED;
> > > > +
> > > > +     return RECOVERED;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Clean and dirty swap cache.
> > > > + */
> > > > +static int me_swapcache_dirty(struct page *p)
> > > > +{
> > > > +     ClearPageDirty(p);
> > > > +
> > > > +     if (!isolate_lru_page(p))
> > > > +             page_cache_release(p);
> > > > +
> > > > +     return DELAYED;
> > > > +}
> > > > +
> > > > +static int me_swapcache_clean(struct page *p)
> > > > +{
> > > > +     ClearPageUptodate(p);
> > > > +
> > > > +     if (!isolate_lru_page(p))
> > > > +             page_cache_release(p);
> > > > +
> > > > +     delete_from_swap_cache(p);
> > > > +
> > > > +     return RECOVERED;
> > > > +}
> > > 
> > > All these handlers are quite interesting in that they need to
> > > know about most of the mm. What are you trying to do in each
> > > of them would be a good idea to say, and probably they should
> > > rather go into their appropriate files instead of all here
> > > (eg. swapcache stuff should go in mm/swap_state for example).
> > 
> > Yup, they at least need more careful comments.
> > 
> > Dirty swap cache page is tricky to handle. The page could live both in page
> > cache and swap cache(ie. page is freshly swapped in). So it could be referenced
> > concurrently by 2 types of PTEs: one normal PTE and another swap PTE. We try to
> > handle them consistently by calling try_to_unmap(TTU_IGNORE_HWPOISON) to convert
> > the normal PTEs to swap PTEs, and then
> >         - clear dirty bit to prevent IO
> >         - remove from LRU
> >         - but keep in the swap cache, so that when we return to it on
> >           a later page fault, we know the application is accessing
> >           corrupted data and shall be killed (we installed simple
> >           interception code in do_swap_page to catch it).
> 
> OK this is the point I was missing.
> 
> Should all be commented and put into mm/swap_state.c (or somewhere that
> Hugh prefers).

But I doubt Hugh will welcome moving that bits into swap*.c ;)

> 
> > Clean swap cache pages can be directly isolated. A later page fault will bring
> > in the known good data from disk.
> 
> OK, but why do you ClearPageUptodate if it is just to be deleted from
> swapcache anyway?

The ClearPageUptodate() is kind of a careless addition, in the hope
that it will stop some random readers. Need more investigations.

> > > You haven't waited on writeback here AFAIKS, and have you
> > > *really* verified it is safe to call delete_from_swap_cache?
> > 
> > Good catch. I'll soon submit patches for handling the under
> > read/write IO pages. In this patchset they are simply ignored.
> 
> Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> It is simple and should work. _Unless_ you can show it is a big problem that
> needs equivalently big mes to fix ;)

Yes we could do wait_on_page_writeback() if necessary. The downside is,
keeping writeback page in page cache opens a small time window for
some one to access the page.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 13:45           ` Andi Kleen
@ 2009-05-28 14:50             ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 14:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:

[snip]

> > 
> > BTW. I don't know if you are checking for PG_writeback often enough?
> > You can't remove a PG_writeback page from pagecache. The normal
> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> 
> So pages can be in writeback without being locked? I still
> wasn't able to find such a case (in fact unless I'm misreading
> the code badly the writeback bit is only used by NFS and a few  
> obscure cases)

Yes the writeback page is typically not locked. Only read IO requires
to be exclusive. Read IO is in fact page *writer*, while writeback IO
is page *reader* :-)

The writeback bit is _widely_ used.  test_set_page_writeback() is
directly used by NFS/AFS etc. But its main user is in fact
set_page_writeback(), which is called in 26 places.

> > think would be safest 
> 
> Okay. I'll just add it after the page lock.
> 
> > (then you never have to bother with the writeback bit again)
> 
> Until Fengguang does something fancy with it.

Yes I'm going to do it without wait_on_page_writeback().

The reason truncate_inode_pages_range() has to wait on writeback page
is to ensure data integrity. Otherwise if there comes two events:
        truncate page A at offset X
        populate page B at offset X
If A and B are all writeback pages, then B can hit disk first and then
be overwritten by A. Which corrupts the data at offset X from user's POV.

But for hwpoison, there are no such worries. If A is poisoned, we do
our best to isolate it as well as intercepting its IO. If the interception
fails, it will trigger another machine check before hitting the disk.

After all, poisoned A means the data at offset X is already corrupted.
It doesn't matter if there comes another B page.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 14:50             ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-05-28 14:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:

[snip]

> > 
> > BTW. I don't know if you are checking for PG_writeback often enough?
> > You can't remove a PG_writeback page from pagecache. The normal
> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> 
> So pages can be in writeback without being locked? I still
> wasn't able to find such a case (in fact unless I'm misreading
> the code badly the writeback bit is only used by NFS and a few  
> obscure cases)

Yes the writeback page is typically not locked. Only read IO requires
to be exclusive. Read IO is in fact page *writer*, while writeback IO
is page *reader* :-)

The writeback bit is _widely_ used.  test_set_page_writeback() is
directly used by NFS/AFS etc. But its main user is in fact
set_page_writeback(), which is called in 26 places.

> > think would be safest 
> 
> Okay. I'll just add it after the page lock.
> 
> > (then you never have to bother with the writeback bit again)
> 
> Until Fengguang does something fancy with it.

Yes I'm going to do it without wait_on_page_writeback().

The reason truncate_inode_pages_range() has to wait on writeback page
is to ensure data integrity. Otherwise if there comes two events:
        truncate page A at offset X
        populate page B at offset X
If A and B are all writeback pages, then B can hit disk first and then
be overwritten by A. Which corrupts the data at offset X from user's POV.

But for hwpoison, there are no such worries. If A is poisoned, we do
our best to isolate it as well as intercepting its IO. If the interception
fails, it will trigger another machine check before hitting the disk.

After all, poisoned A means the data at offset X is already corrupted.
It doesn't matter if there comes another B page.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 13:45           ` Andi Kleen
@ 2009-05-28 16:56             ` Russ Anderson
  -1 siblings, 0 replies; 232+ messages in thread
From: Russ Anderson @ 2009-05-28 16:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu, rja

On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > > > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > > > > +			return;
> > > > > +		}
> > > > > +	}
> > > > > +	tk->addr = page_address_in_vma(p, vma);
> > > > > +	if (tk->addr == -EFAULT) {
> > > > > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> > > > 
> > > > I don't know if this is very helpful message. I could legitimately happen and
> > > > nothing anybody can do about it...
> > > 
> > > Can you suggest a better message?
> > 
> > Well, for userspace, nothing? At the very least ratelimited, and preferably
> > telling a more high level of what the problem and consequences are.
> 
> I changed it to 
> 
>  "MCE: Unable to determine user space address during error handling\n")
> 
> Still not perfect, but hopefully better.

Is it even worth having a message at all?  Does the fact that page_address_in_vma()
failed change the behavior in any way?  (Does tk->addr == 0 matter?)  From
a quick scan of the code I do not believe it does.

If the message is for developers/debugging, it would be nice to have more
information, such as why did page_address_in_vma() return -EFAULT.  If
that is important, page_address_in_vma() sould return a different failure 
status for each of the three failing conditions.  But that would only
be needed if the code (potentially) was going to do some additional handling.


Thanks,
-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-28 16:56             ` Russ Anderson
  0 siblings, 0 replies; 232+ messages in thread
From: Russ Anderson @ 2009-05-28 16:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu, rja

On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > > > +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> > > > > +			return;
> > > > > +		}
> > > > > +	}
> > > > > +	tk->addr = page_address_in_vma(p, vma);
> > > > > +	if (tk->addr == -EFAULT) {
> > > > > +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");
> > > > 
> > > > I don't know if this is very helpful message. I could legitimately happen and
> > > > nothing anybody can do about it...
> > > 
> > > Can you suggest a better message?
> > 
> > Well, for userspace, nothing? At the very least ratelimited, and preferably
> > telling a more high level of what the problem and consequences are.
> 
> I changed it to 
> 
>  "MCE: Unable to determine user space address during error handling\n")
> 
> Still not perfect, but hopefully better.

Is it even worth having a message at all?  Does the fact that page_address_in_vma()
failed change the behavior in any way?  (Does tk->addr == 0 matter?)  From
a quick scan of the code I do not believe it does.

If the message is for developers/debugging, it would be nice to have more
information, such as why did page_address_in_vma() return -EFAULT.  If
that is important, page_address_in_vma() sould return a different failure 
status for each of the three failing conditions.  But that would only
be needed if the code (potentially) was going to do some additional handling.


Thanks,
-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-29  4:15     ` Hidehiro Kawai
  -1 siblings, 0 replies; 232+ messages in thread
From: Hidehiro Kawai @ 2009-05-29  4:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, linux-mm, fengguang.wu, Satoshi OSHIMA,
	Taketoshi Sakuraba

Andi Kleen wrote:

> - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
> architectures have to explicitely enable poison page support, so
> this is forward compatible to all architectures. They only need
> to add it when they enable poison page support.
> - Add poison page handling in swap in fault code
> 
> v2: Add missing delayacct_clear_flag (Hidehiro Kawai)

[snip]

>  		goto out;
>  	}
>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> @@ -2484,6 +2492,10 @@
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		count_vm_event(PGMAJFAULT);
> +	} else if (PageHWPoison(page)) {
> +		ret = VM_FAULT_HWPOISON;
> +		delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> +		goto out;

Is this delayacct_clear_flag()? :-p

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2
@ 2009-05-29  4:15     ` Hidehiro Kawai
  0 siblings, 0 replies; 232+ messages in thread
From: Hidehiro Kawai @ 2009-05-29  4:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-kernel, linux-mm, fengguang.wu, Satoshi OSHIMA,
	Taketoshi Sakuraba

Andi Kleen wrote:

> - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
> architectures have to explicitely enable poison page support, so
> this is forward compatible to all architectures. They only need
> to add it when they enable poison page support.
> - Add poison page handling in swap in fault code
> 
> v2: Add missing delayacct_clear_flag (Hidehiro Kawai)

[snip]

>  		goto out;
>  	}
>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> @@ -2484,6 +2492,10 @@
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		count_vm_event(PGMAJFAULT);
> +	} else if (PageHWPoison(page)) {
> +		ret = VM_FAULT_HWPOISON;
> +		delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> +		goto out;

Is this delayacct_clear_flag()? :-p

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2
  2009-05-29  4:15     ` Hidehiro Kawai
@ 2009-05-29  6:28       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29  6:28 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu,
	Satoshi OSHIMA, Taketoshi Sakuraba

> Is this delayacct_clear_flag()? :-p

Hmpf.... Thanks. Fixed. Sorry about this.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2
@ 2009-05-29  6:28       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29  6:28 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu,
	Satoshi OSHIMA, Taketoshi Sakuraba

> Is this delayacct_clear_flag()? :-p

Hmpf.... Thanks. Fixed. Sorry about this.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-28  7:54       ` Andi Kleen
@ 2009-05-29 16:10         ` Rik van Riel
  -1 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 16:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

Andi Kleen wrote:
> On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
>> On Wed, 27 May 2009 22:12:26 +0200 (CEST)
>> Andi Kleen <andi@firstfloor.org> wrote:
>>
>>> Hardware poisoned pages need special handling in the VM and shouldn't be 
>>> touched again. This requires a new page flag. Define it here.
>> Why can't you use PG_reserved ? That already indicates the page may not
>> even be present (which is effectively your situation at that point).
> 
> Right now a page must be present with PG_reserved, otherwise /dev/mem, /proc/kcore
> lots of other things will explode.

Could we use a combination of, say PG_reserved and
PG_writeback to keep /dev/mem and /proc/kcore from
exploding ?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-29 16:10         ` Rik van Riel
  0 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 16:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

Andi Kleen wrote:
> On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
>> On Wed, 27 May 2009 22:12:26 +0200 (CEST)
>> Andi Kleen <andi@firstfloor.org> wrote:
>>
>>> Hardware poisoned pages need special handling in the VM and shouldn't be 
>>> touched again. This requires a new page flag. Define it here.
>> Why can't you use PG_reserved ? That already indicates the page may not
>> even be present (which is effectively your situation at that point).
> 
> Right now a page must be present with PG_reserved, otherwise /dev/mem, /proc/kcore
> lots of other things will explode.

Could we use a combination of, say PG_reserved and
PG_writeback to keep /dev/mem and /proc/kcore from
exploding ?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 16:37           ` Andi Kleen
@ 2009-05-29 16:34             ` Rik van Riel
  -1 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 16:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

Andi Kleen wrote:
> On Fri, May 29, 2009 at 12:10:24PM -0400, Rik van Riel wrote:
>> Andi Kleen wrote:
>>> On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
>>>> On Wed, 27 May 2009 22:12:26 +0200 (CEST)
>>>> Andi Kleen <andi@firstfloor.org> wrote:
>>>>
>>>>> Hardware poisoned pages need special handling in the VM and shouldn't be 
>>>>> touched again. This requires a new page flag. Define it here.
>>>> Why can't you use PG_reserved ? That already indicates the page may not
>>>> even be present (which is effectively your situation at that point).
>>> Right now a page must be present with PG_reserved, otherwise /dev/mem, 
>>> /proc/kcore
>>> lots of other things will explode.
>> Could we use a combination of, say PG_reserved and
>> PG_writeback to keep /dev/mem and /proc/kcore from
>> exploding ?
> 
> They should just check for poisoned pages. 

#define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-29 16:34             ` Rik van Riel
  0 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 16:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

Andi Kleen wrote:
> On Fri, May 29, 2009 at 12:10:24PM -0400, Rik van Riel wrote:
>> Andi Kleen wrote:
>>> On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
>>>> On Wed, 27 May 2009 22:12:26 +0200 (CEST)
>>>> Andi Kleen <andi@firstfloor.org> wrote:
>>>>
>>>>> Hardware poisoned pages need special handling in the VM and shouldn't be 
>>>>> touched again. This requires a new page flag. Define it here.
>>>> Why can't you use PG_reserved ? That already indicates the page may not
>>>> even be present (which is effectively your situation at that point).
>>> Right now a page must be present with PG_reserved, otherwise /dev/mem, 
>>> /proc/kcore
>>> lots of other things will explode.
>> Could we use a combination of, say PG_reserved and
>> PG_writeback to keep /dev/mem and /proc/kcore from
>> exploding ?
> 
> They should just check for poisoned pages. 

#define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags
  2009-05-27 20:12   ` Andi Kleen
@ 2009-05-29 16:37     ` Rik van Riel
  -1 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 16:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, akpm, linux-kernel, linux-mm

Andi Kleen wrote:
> From: Fengguang Wu <fengguang.wu@intel.com>
> 
> Export the new poison flag in /proc/kpageflags. Poisoned pages are moderately
> interesting even for administrators, so export them here. Also useful
> for debugging.
> 
> AK: I extracted this out of a larger patch from Fengguang Wu.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

OK, this could be a good reason for the use of the PG_poisoned page
flag in patch 1/16.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags
@ 2009-05-29 16:37     ` Rik van Riel
  0 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 16:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, akpm, linux-kernel, linux-mm

Andi Kleen wrote:
> From: Fengguang Wu <fengguang.wu@intel.com>
> 
> Export the new poison flag in /proc/kpageflags. Poisoned pages are moderately
> interesting even for administrators, so export them here. Also useful
> for debugging.
> 
> AK: I extracted this out of a larger patch from Fengguang Wu.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

OK, this could be a good reason for the use of the PG_poisoned page
flag in patch 1/16.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 16:10         ` Rik van Riel
@ 2009-05-29 16:37           ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29 16:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 12:10:24PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
> >On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
> >>On Wed, 27 May 2009 22:12:26 +0200 (CEST)
> >>Andi Kleen <andi@firstfloor.org> wrote:
> >>
> >>>Hardware poisoned pages need special handling in the VM and shouldn't be 
> >>>touched again. This requires a new page flag. Define it here.
> >>Why can't you use PG_reserved ? That already indicates the page may not
> >>even be present (which is effectively your situation at that point).
> >
> >Right now a page must be present with PG_reserved, otherwise /dev/mem, 
> >/proc/kcore
> >lots of other things will explode.
> 
> Could we use a combination of, say PG_reserved and
> PG_writeback to keep /dev/mem and /proc/kcore from
> exploding ?

They should just check for poisoned pages. Fengguang has some patches
to add checks, but they need more work before they can be merged.

The interesting part is also fixing vmcore, as in memory outside
your memory.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-29 16:37           ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29 16:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 12:10:24PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
> >On Wed, May 27, 2009 at 10:15:10PM +0100, Alan Cox wrote:
> >>On Wed, 27 May 2009 22:12:26 +0200 (CEST)
> >>Andi Kleen <andi@firstfloor.org> wrote:
> >>
> >>>Hardware poisoned pages need special handling in the VM and shouldn't be 
> >>>touched again. This requires a new page flag. Define it here.
> >>Why can't you use PG_reserved ? That already indicates the page may not
> >>even be present (which is effectively your situation at that point).
> >
> >Right now a page must be present with PG_reserved, otherwise /dev/mem, 
> >/proc/kcore
> >lots of other things will explode.
> 
> Could we use a combination of, say PG_reserved and
> PG_writeback to keep /dev/mem and /proc/kcore from
> exploding ?

They should just check for poisoned pages. Fengguang has some patches
to add checks, but they need more work before they can be merged.

The interesting part is also fixing vmcore, as in memory outside
your memory.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 16:34             ` Rik van Riel
@ 2009-05-29 18:24               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29 18:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 12:34:32PM -0400, Rik van Riel wrote:
> >They should just check for poisoned pages. 
> 
> #define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))

I don't know what the point of that would be. An exercise in code
obfuscation?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-29 18:24               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29 18:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 12:34:32PM -0400, Rik van Riel wrote:
> >They should just check for poisoned pages. 
> 
> #define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))

I don't know what the point of that would be. An exercise in code
obfuscation?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 18:24               ` Andi Kleen
@ 2009-05-29 18:26                 ` Rik van Riel
  -1 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 18:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

Andi Kleen wrote:
> On Fri, May 29, 2009 at 12:34:32PM -0400, Rik van Riel wrote:
>>> They should just check for poisoned pages. 
>> #define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))
> 
> I don't know what the point of that would be. An exercise in code
> obfuscation?

Saving a page flag.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-29 18:26                 ` Rik van Riel
  0 siblings, 0 replies; 232+ messages in thread
From: Rik van Riel @ 2009-05-29 18:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

Andi Kleen wrote:
> On Fri, May 29, 2009 at 12:34:32PM -0400, Rik van Riel wrote:
>>> They should just check for poisoned pages. 
>> #define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))
> 
> I don't know what the point of that would be. An exercise in code
> obfuscation?

Saving a page flag.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 18:26                 ` Rik van Riel
@ 2009-05-29 18:42                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29 18:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 02:26:05PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
> >On Fri, May 29, 2009 at 12:34:32PM -0400, Rik van Riel wrote:
> >>>They should just check for poisoned pages. 
> >>#define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))
> >
> >I don't know what the point of that would be. An exercise in code
> >obfuscation?
> 
> Saving a page flag.

It seems pointless to me. 64bit has enough space and 32bit just puts
less node bits into ->flags.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
@ 2009-05-29 18:42                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-29 18:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alan Cox, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 02:26:05PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
> >On Fri, May 29, 2009 at 12:34:32PM -0400, Rik van Riel wrote:
> >>>They should just check for poisoned pages. 
> >>#define PagePoisoned(page) (PageReserved(page) && PageWriteback(page))
> >
> >I don't know what the point of that would be. An exercise in code
> >obfuscation?
> 
> Saving a page flag.

It seems pointless to me. 64bit has enough space and 32bit just puts
less node bits into ->flags.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 16:56             ` Russ Anderson
@ 2009-05-30  6:42               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-30  6:42 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm, fengguang.wu

Sorry for late answer, email slipped out earlier.

On Thu, May 28, 2009 at 11:56:25AM -0500, Russ Anderson wrote:
> > I changed it to 
> > 
> >  "MCE: Unable to determine user space address during error handling\n")
> > 
> > Still not perfect, but hopefully better.
> 
> Is it even worth having a message at all?  Does the fact that page_address_in_vma()

I like having a message so that I can see when it happens.

> failed change the behavior in any way?  (Does tk->addr == 0 matter?)  From

It just doesn't report an address to the user (or rather 0)

> If the message is for developers/debugging, it would be nice to have more

It's not really for debugging only, it's a legitimate case. Typically
when the process unmaps or remaps in parallel. Of course when it currently
unmaps you could argue it doesn't need the data anymore and doesn't
need to be killed (that's true), but that doesn't work for mremap()ing.
I considered at some point to loop, but that would risk live lock.
So it just prints and reports nothing.

The only ugly part is the ambiguity of reporting a 0 address (in theory
there could be real memory 0 on virtual 0), but that didn't seem to be
enough an issue to fix.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-05-30  6:42               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-05-30  6:42 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm, fengguang.wu

Sorry for late answer, email slipped out earlier.

On Thu, May 28, 2009 at 11:56:25AM -0500, Russ Anderson wrote:
> > I changed it to 
> > 
> >  "MCE: Unable to determine user space address during error handling\n")
> > 
> > Still not perfect, but hopefully better.
> 
> Is it even worth having a message at all?  Does the fact that page_address_in_vma()

I like having a message so that I can see when it happens.

> failed change the behavior in any way?  (Does tk->addr == 0 matter?)  From

It just doesn't report an address to the user (or rather 0)

> If the message is for developers/debugging, it would be nice to have more

It's not really for debugging only, it's a legitimate case. Typically
when the process unmaps or remaps in parallel. Of course when it currently
unmaps you could argue it doesn't need the data anymore and doesn't
need to be killed (that's true), but that doesn't work for mremap()ing.
I considered at some point to loop, but that would risk live lock.
So it just prints and reports nothing.

The only ugly part is the ambiguity of reporting a 0 address (in theory
there could be real memory 0 on virtual 0), but that didn't seem to be
enough an issue to fix.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-30  6:42               ` Andi Kleen
@ 2009-06-01 11:39                 ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 11:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Russ Anderson, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Sat, May 30, 2009 at 08:42:44AM +0200, Andi Kleen wrote:
> Sorry for late answer, email slipped out earlier.
> 
> On Thu, May 28, 2009 at 11:56:25AM -0500, Russ Anderson wrote:
> > > I changed it to 
> > > 
> > >  "MCE: Unable to determine user space address during error handling\n")
> > > 
> > > Still not perfect, but hopefully better.
> > 
> > Is it even worth having a message at all?  Does the fact that page_address_in_vma()
> 
> I like having a message so that I can see when it happens.
> 
> > failed change the behavior in any way?  (Does tk->addr == 0 matter?)  From
> 
> It just doesn't report an address to the user (or rather 0)
> 
> > If the message is for developers/debugging, it would be nice to have more
> 
> It's not really for debugging only, it's a legitimate case. Typically
> when the process unmaps or remaps in parallel. Of course when it currently
> unmaps you could argue it doesn't need the data anymore and doesn't
> need to be killed (that's true), but that doesn't work for mremap()ing.
> I considered at some point to loop, but that would risk live lock.
> So it just prints and reports nothing.
> 
> The only ugly part is the ambiguity of reporting a 0 address (in theory
> there could be real memory 0 on virtual 0), but that didn't seem to be
> enough an issue to fix.

Just isn't really something we typically give a dmesg for.

Surely you can test it out these cases with debugging code or printks
and then take them out of production code?


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 11:39                 ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 11:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Russ Anderson, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Sat, May 30, 2009 at 08:42:44AM +0200, Andi Kleen wrote:
> Sorry for late answer, email slipped out earlier.
> 
> On Thu, May 28, 2009 at 11:56:25AM -0500, Russ Anderson wrote:
> > > I changed it to 
> > > 
> > >  "MCE: Unable to determine user space address during error handling\n")
> > > 
> > > Still not perfect, but hopefully better.
> > 
> > Is it even worth having a message at all?  Does the fact that page_address_in_vma()
> 
> I like having a message so that I can see when it happens.
> 
> > failed change the behavior in any way?  (Does tk->addr == 0 matter?)  From
> 
> It just doesn't report an address to the user (or rather 0)
> 
> > If the message is for developers/debugging, it would be nice to have more
> 
> It's not really for debugging only, it's a legitimate case. Typically
> when the process unmaps or remaps in parallel. Of course when it currently
> unmaps you could argue it doesn't need the data anymore and doesn't
> need to be killed (that's true), but that doesn't work for mremap()ing.
> I considered at some point to loop, but that would risk live lock.
> So it just prints and reports nothing.
> 
> The only ugly part is the ambiguity of reporting a 0 address (in theory
> there could be real memory 0 on virtual 0), but that didn't seem to be
> enough an issue to fix.

Just isn't really something we typically give a dmesg for.

Surely you can test it out these cases with debugging code or printks
and then take them out of production code?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 13:54           ` Wu Fengguang
@ 2009-06-01 11:50             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 11:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > > Hi Nick,
> > > 
> > > > > +     /*
> > > > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > > > +      */
> > > > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > > > +             remove_from_page_cache(p);
> > > > > +             page_cache_release(p);
> > > > > +     }
> > > > 
> > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > probably introduce pagecache removal vs page fault races which
> > > > will make the kernel bug.
> > > 
> > > We use remove_mapping() at first, then discovered that it made strong
> > > assumption on page_count=2.
> > > 
> > > I guess it is safe from races since we are locking the page?
> > 
> > Yes it probably should (although you will lose get_user_pages data, but
> > I guess that's the aim anyway).
> 
> Yes. We (and truncate) rely heavily on this logic:
> 
>         retry:
>                 lock_page(page);
>                 if (page->mapping == NULL)
>                         goto retry;
>                 // do something on page
>                 unlock_page(page);
> 
> So that we can steal/isolate a page under its page lock.
> 
> The truncate code does wait on writeback page, but we would like to
> isolate the page ASAP, so as to avoid someone to find it in the page
> cache (or swap cache) and then access its content.
> 
> I see no obvious problems to isolate a writeback page from page cache
> or swap cache. But also I'm not sure it won't break some assumption
> in some corner of the kernel.

The problem is that then you have lost synchronization in the
pagecache. Nothing then prevents a new page from being put
in there and trying to do IO to or from the same device as the
currently running writeback.

 
> > But I just don't like this one file having all that required knowledge
> 
> Yes that's a big problem.
> 
> One major complexity involves classify the page into different known
> types, by testing page flags, page_mapping, page_mapped, etc. This
> is not avoidable.

No.

 
> Another major complexity is on calling the isolation routines to
> remove references from
>         - PTE
>         - page cache
>         - swap cache
>         - LRU list
> They more or less made some assumptions on their operating environment
> that we have to take care of.  Unfortunately these complexities are
> also not easily resolvable.
> 
> > (and few comments) of all the files in mm/. If you want to get rid
> 
> I promise I'll add more comments :)

OK, but they should still go in their relevant files. Or as best as
possible. Right now it's just silly to have all this here when much
of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.


> > of the page and don't care what it's count or dirtyness is, then
> > truncate_inode_pages_range is the correct API to use.
> >
> > (or you could extract out some of it so you can call it directly on
> > individual locked pages, if that helps).
>  
> The patch to move over to truncate_complete_page() would like this.
> It's not a big win indeed.

No I don't mean to do this, but to move the truncate_inode_pages
code for truncating a single, locked, page into another function
in mm/truncate.c and then call that from here.

> 
> ---
>  mm/memory-failure.c |   14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
>  	if (!isolate_lru_page(p))
>  		page_cache_release(p);
>  
> -	if (page_has_private(p))
> -		do_invalidatepage(p, 0);
> -	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> -		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> -			page_to_pfn(p));
> -
>  	/*
>  	 * remove_from_page_cache assumes (mapping && !mapped)
>  	 */
>  	if (page_mapping(p) && !page_mapped(p)) {
> -		remove_from_page_cache(p);
> -		page_cache_release(p);
> +                ClearPageMlocked(p);
> +                truncate_complete_page(p->mapping, p)
>  	}
>  
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> +			page_to_pfn(p));
> +
>  	return RECOVERED;
>  }
>  
> 
> > OK this is the point I was missing.
> > 
> > Should all be commented and put into mm/swap_state.c (or somewhere that
> > Hugh prefers).
> 
> But I doubt Hugh will welcome moving that bits into swap*.c ;)

Why not? If he has to look at it anyway, he probably rather looks
at fewer files :)

 
> > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > in the known good data from disk.
> > 
> > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > swapcache anyway?
> 
> The ClearPageUptodate() is kind of a careless addition, in the hope
> that it will stop some random readers. Need more investigations.

OK. But it just muddies the waters in the meantime, so maybe take
such things out until there is a case for them.

 
> > > > You haven't waited on writeback here AFAIKS, and have you
> > > > *really* verified it is safe to call delete_from_swap_cache?
> > > 
> > > Good catch. I'll soon submit patches for handling the under
> > > read/write IO pages. In this patchset they are simply ignored.
> > 
> > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > It is simple and should work. _Unless_ you can show it is a big problem that
> > needs equivalently big mes to fix ;)
> 
> Yes we could do wait_on_page_writeback() if necessary. The downside is,
> keeping writeback page in page cache opens a small time window for
> some one to access the page.

AFAIKS there already is such a window? You're doing lock_page and such.
No, it seems rather insane to do something like this here that no other
code in the mm ever does.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 11:50             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 11:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > > Hi Nick,
> > > 
> > > > > +     /*
> > > > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > > > +      */
> > > > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > > > +             remove_from_page_cache(p);
> > > > > +             page_cache_release(p);
> > > > > +     }
> > > > 
> > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > probably introduce pagecache removal vs page fault races which
> > > > will make the kernel bug.
> > > 
> > > We use remove_mapping() at first, then discovered that it made strong
> > > assumption on page_count=2.
> > > 
> > > I guess it is safe from races since we are locking the page?
> > 
> > Yes it probably should (although you will lose get_user_pages data, but
> > I guess that's the aim anyway).
> 
> Yes. We (and truncate) rely heavily on this logic:
> 
>         retry:
>                 lock_page(page);
>                 if (page->mapping == NULL)
>                         goto retry;
>                 // do something on page
>                 unlock_page(page);
> 
> So that we can steal/isolate a page under its page lock.
> 
> The truncate code does wait on writeback page, but we would like to
> isolate the page ASAP, so as to avoid someone to find it in the page
> cache (or swap cache) and then access its content.
> 
> I see no obvious problems to isolate a writeback page from page cache
> or swap cache. But also I'm not sure it won't break some assumption
> in some corner of the kernel.

The problem is that then you have lost synchronization in the
pagecache. Nothing then prevents a new page from being put
in there and trying to do IO to or from the same device as the
currently running writeback.

 
> > But I just don't like this one file having all that required knowledge
> 
> Yes that's a big problem.
> 
> One major complexity involves classify the page into different known
> types, by testing page flags, page_mapping, page_mapped, etc. This
> is not avoidable.

No.

 
> Another major complexity is on calling the isolation routines to
> remove references from
>         - PTE
>         - page cache
>         - swap cache
>         - LRU list
> They more or less made some assumptions on their operating environment
> that we have to take care of.  Unfortunately these complexities are
> also not easily resolvable.
> 
> > (and few comments) of all the files in mm/. If you want to get rid
> 
> I promise I'll add more comments :)

OK, but they should still go in their relevant files. Or as best as
possible. Right now it's just silly to have all this here when much
of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.


> > of the page and don't care what it's count or dirtyness is, then
> > truncate_inode_pages_range is the correct API to use.
> >
> > (or you could extract out some of it so you can call it directly on
> > individual locked pages, if that helps).
>  
> The patch to move over to truncate_complete_page() would like this.
> It's not a big win indeed.

No I don't mean to do this, but to move the truncate_inode_pages
code for truncating a single, locked, page into another function
in mm/truncate.c and then call that from here.

> 
> ---
>  mm/memory-failure.c |   14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
>  	if (!isolate_lru_page(p))
>  		page_cache_release(p);
>  
> -	if (page_has_private(p))
> -		do_invalidatepage(p, 0);
> -	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> -		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> -			page_to_pfn(p));
> -
>  	/*
>  	 * remove_from_page_cache assumes (mapping && !mapped)
>  	 */
>  	if (page_mapping(p) && !page_mapped(p)) {
> -		remove_from_page_cache(p);
> -		page_cache_release(p);
> +                ClearPageMlocked(p);
> +                truncate_complete_page(p->mapping, p)
>  	}
>  
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> +			page_to_pfn(p));
> +
>  	return RECOVERED;
>  }
>  
> 
> > OK this is the point I was missing.
> > 
> > Should all be commented and put into mm/swap_state.c (or somewhere that
> > Hugh prefers).
> 
> But I doubt Hugh will welcome moving that bits into swap*.c ;)

Why not? If he has to look at it anyway, he probably rather looks
at fewer files :)

 
> > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > in the known good data from disk.
> > 
> > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > swapcache anyway?
> 
> The ClearPageUptodate() is kind of a careless addition, in the hope
> that it will stop some random readers. Need more investigations.

OK. But it just muddies the waters in the meantime, so maybe take
such things out until there is a case for them.

 
> > > > You haven't waited on writeback here AFAIKS, and have you
> > > > *really* verified it is safe to call delete_from_swap_cache?
> > > 
> > > Good catch. I'll soon submit patches for handling the under
> > > read/write IO pages. In this patchset they are simply ignored.
> > 
> > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > It is simple and should work. _Unless_ you can show it is a big problem that
> > needs equivalently big mes to fix ;)
> 
> Yes we could do wait_on_page_writeback() if necessary. The downside is,
> keeping writeback page in page cache opens a small time window for
> some one to access the page.

AFAIKS there already is such a window? You're doing lock_page and such.
No, it seems rather insane to do something like this here that no other
code in the mm ever does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-05-28 13:45           ` Andi Kleen
@ 2009-06-01 12:05             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 12:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > Then the data can not have been consumed, by DMA or otherwise? What
> 
> When the data was consumed we get a different machine check
> (or a different error if it was consumed by a IO device)
> 
> This code right now just handles the case of "CPU detected a page is broken
> is wrong, but hasn't consumed it yet"

OK. Out of curiosity, how often do you expect to see uncorrectable ECC
errors?

 
> > about transient kernel references to the (pagecache/anonymous) page
> > (such as, find_get_page for read(2), or get_user_pages callers).	
> 
> There are always races, after all the CPU could be just about 
> right now to consume. If we lose the race there will be just
> another machine check that stops the consumption of bad data.
> The hardware takes care of that.
> 
> The code here doesn't try to be a 100% coverage of all
> cases (that's obviously impossible), just to handle
> common page types. I also originally had ideas for more handlers,
> but found out how hard it is to test, so I burried a lot of fancy
> ideas :-)
> 
> If there are left over references we complain at least.

Well I don't know about that, because you'd probably have races
between lookups and taking reference. But nevermind.

 
> > > > > +	/*
> > > > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > > > +	 */
> > > > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > > > +		remove_from_page_cache(p);
> > > > > +		page_cache_release(p);
> > > > > +	}
> > > > 
> > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > probably introduce pagecache removal vs page fault races whi
> > > > will make the kernel bug.
> > > 
> > > Can you be more specific about the problems?
> > 
> > Hmm, actually now that we hold the page lock over __do_fault (at least
> > for pagecache pages), this may not be able to trigger the race I was
> > thinking of (page becoming mapped). But I think still it is better
> > to use remove_mapping which is the standard way to remove such a page.
> 
> I had this originally, but Fengguang redid it because there was
> trouble with the reference count. remove_mapping always expects it to
> be 2, which we cannot guarantee.

OK, but it should still definitely use truncate code.

> > > > > +	if (mapping) {
> > > > > +		/*
> > > > > +		 * Truncate does the same, but we're not quite the same
> > > > > +		 * as truncate. Needs more checking, but keep it for now.
> > > > > +		 */
> > > > 
> > > > What's different about truncate? It would be good to reuse as much as possible.
> > > 
> > > Truncating removes the block on disk (we don't). Truncating shrinks
> > > the end of the file (we don't). It's more "temporal hole punch"
> > > Probably from the VM point of view it's very similar, but it's
> > > not the same.
> > 
> > Right, I just mean the pagecache side of the truncate. So you should
> > use truncate_inode_pages_range here.
> 
> Why?  I remember I was trying to use that function very early on but
> there was some problem.  For once it does its own locking which
> would conflict with ours.

Just extract the part where it has the page locked into a common
function.


> Also we already do a lot of the stuff it does (like unmapping).

That's OK, it will not try to unmap again if it sees !page_mapped.
 

> Is there anything concretely wrong with the current code?


/*
 * Truncate does the same, but we're not quite the same
 * as truncate. Needs more checking, but keep it for now.
 */

I guess that it duplicates this tricky truncate code and also
says it is different (but AFAIKS it is doing exactly the same
thing).



> > Well, the dirty data has never been corrupted before (ie. the data
> > in pagecache has been OK). It was just unable to make it back to
> > backing store. So a program could retry the write/fsync/etc or
> > try to write the data somewhere else.
> 
> In theory it could, but in practice it is very unlikely it would.

very unlikely to try rewriting the page again or taking evasive
action to save the data somewhere else? I think that's a bold
assumption.

At the very least, having a prompt "IO error, check your hotplug
device / network connection / etc and try again" I don't think
sounds unreasonable at all.


> > It kind of wants a new error code, but I can't imagine the difficulty
> > in doing that...
> 
> I don't think it's a good idea to change programs for this normally,
> most wouldn't anyways. Even kernel programmers have trouble with
> memory suddenly going bad, for user space programmers it would
> be pure voodoo.
 
No, I mean as something to distinguish it between the  case of IO
device going bad (in which case the userspace program has a number
of things it might try to do). 


> > Just seems overengineered. We could rewrite any if/switch statement like
> > that (and actually the compiler probably will if it is beneficial).
> 
> The reason I like it is that it separates the functions cleanly,
> without that there would be a dispatcher from hell. Yes it's a bit
> ugly that there is a lot of manual bit checking around now too,
> but as you go into all the corner cases originally clean code
> always tends to get more ugly (and this is a really ugly problem)

Well... it is just writing another dispatcher from hell in a
different way really, isn't it? How is it so much better than
a simple switch or if/elseif/else statement?


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 12:05             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 12:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > Then the data can not have been consumed, by DMA or otherwise? What
> 
> When the data was consumed we get a different machine check
> (or a different error if it was consumed by a IO device)
> 
> This code right now just handles the case of "CPU detected a page is broken
> is wrong, but hasn't consumed it yet"

OK. Out of curiosity, how often do you expect to see uncorrectable ECC
errors?

 
> > about transient kernel references to the (pagecache/anonymous) page
> > (such as, find_get_page for read(2), or get_user_pages callers).	
> 
> There are always races, after all the CPU could be just about 
> right now to consume. If we lose the race there will be just
> another machine check that stops the consumption of bad data.
> The hardware takes care of that.
> 
> The code here doesn't try to be a 100% coverage of all
> cases (that's obviously impossible), just to handle
> common page types. I also originally had ideas for more handlers,
> but found out how hard it is to test, so I burried a lot of fancy
> ideas :-)
> 
> If there are left over references we complain at least.

Well I don't know about that, because you'd probably have races
between lookups and taking reference. But nevermind.

 
> > > > > +	/*
> > > > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > > > +	 */
> > > > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > > > +		remove_from_page_cache(p);
> > > > > +		page_cache_release(p);
> > > > > +	}
> > > > 
> > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > probably introduce pagecache removal vs page fault races whi
> > > > will make the kernel bug.
> > > 
> > > Can you be more specific about the problems?
> > 
> > Hmm, actually now that we hold the page lock over __do_fault (at least
> > for pagecache pages), this may not be able to trigger the race I was
> > thinking of (page becoming mapped). But I think still it is better
> > to use remove_mapping which is the standard way to remove such a page.
> 
> I had this originally, but Fengguang redid it because there was
> trouble with the reference count. remove_mapping always expects it to
> be 2, which we cannot guarantee.

OK, but it should still definitely use truncate code.

> > > > > +	if (mapping) {
> > > > > +		/*
> > > > > +		 * Truncate does the same, but we're not quite the same
> > > > > +		 * as truncate. Needs more checking, but keep it for now.
> > > > > +		 */
> > > > 
> > > > What's different about truncate? It would be good to reuse as much as possible.
> > > 
> > > Truncating removes the block on disk (we don't). Truncating shrinks
> > > the end of the file (we don't). It's more "temporal hole punch"
> > > Probably from the VM point of view it's very similar, but it's
> > > not the same.
> > 
> > Right, I just mean the pagecache side of the truncate. So you should
> > use truncate_inode_pages_range here.
> 
> Why?  I remember I was trying to use that function very early on but
> there was some problem.  For once it does its own locking which
> would conflict with ours.

Just extract the part where it has the page locked into a common
function.


> Also we already do a lot of the stuff it does (like unmapping).

That's OK, it will not try to unmap again if it sees !page_mapped.
 

> Is there anything concretely wrong with the current code?


/*
 * Truncate does the same, but we're not quite the same
 * as truncate. Needs more checking, but keep it for now.
 */

I guess that it duplicates this tricky truncate code and also
says it is different (but AFAIKS it is doing exactly the same
thing).



> > Well, the dirty data has never been corrupted before (ie. the data
> > in pagecache has been OK). It was just unable to make it back to
> > backing store. So a program could retry the write/fsync/etc or
> > try to write the data somewhere else.
> 
> In theory it could, but in practice it is very unlikely it would.

very unlikely to try rewriting the page again or taking evasive
action to save the data somewhere else? I think that's a bold
assumption.

At the very least, having a prompt "IO error, check your hotplug
device / network connection / etc and try again" I don't think
sounds unreasonable at all.


> > It kind of wants a new error code, but I can't imagine the difficulty
> > in doing that...
> 
> I don't think it's a good idea to change programs for this normally,
> most wouldn't anyways. Even kernel programmers have trouble with
> memory suddenly going bad, for user space programmers it would
> be pure voodoo.
 
No, I mean as something to distinguish it between the  case of IO
device going bad (in which case the userspace program has a number
of things it might try to do). 


> > Just seems overengineered. We could rewrite any if/switch statement like
> > that (and actually the compiler probably will if it is beneficial).
> 
> The reason I like it is that it separates the functions cleanly,
> without that there would be a dispatcher from hell. Yes it's a bit
> ugly that there is a lot of manual bit checking around now too,
> but as you go into all the corner cases originally clean code
> always tends to get more ugly (and this is a really ugly problem)

Well... it is just writing another dispatcher from hell in a
different way really, isn't it? How is it so much better than
a simple switch or if/elseif/else statement?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 11:50             ` Nick Piggin
@ 2009-06-01 14:05               ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-01 14:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> > On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > > On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > > > Hi Nick,
> > > >
> > > > > > +     /*
> > > > > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > > > > +      */
> > > > > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > > > > +             remove_from_page_cache(p);
> > > > > > +             page_cache_release(p);
> > > > > > +     }
> > > > >
> > > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > > probably introduce pagecache removal vs page fault races which
> > > > > will make the kernel bug.
> > > >
> > > > We use remove_mapping() at first, then discovered that it made strong
> > > > assumption on page_count=2.
> > > >
> > > > I guess it is safe from races since we are locking the page?
> > >
> > > Yes it probably should (although you will lose get_user_pages data, but
> > > I guess that's the aim anyway).
> >
> > Yes. We (and truncate) rely heavily on this logic:
> >
> >         retry:
> >                 lock_page(page);
> >                 if (page->mapping == NULL)
> >                         goto retry;
> >                 // do something on page
> >                 unlock_page(page);
> >
> > So that we can steal/isolate a page under its page lock.
> >
> > The truncate code does wait on writeback page, but we would like to
> > isolate the page ASAP, so as to avoid someone to find it in the page
> > cache (or swap cache) and then access its content.
> >
> > I see no obvious problems to isolate a writeback page from page cache
> > or swap cache. But also I'm not sure it won't break some assumption
> > in some corner of the kernel.
>
> The problem is that then you have lost synchronization in the
> pagecache. Nothing then prevents a new page from being put
> in there and trying to do IO to or from the same device as the
> currently running writeback.

[ I'm not setting mine mind to get rid of wait_on_page_writeback(),
  however I'm curious about the consequences of not doing it :)     ]

You are right in that IO can happen for a new page at the same file offset.
But I have analyzed that in another email:

: The reason truncate_inode_pages_range() has to wait on writeback page
: is to ensure data integrity. Otherwise if there comes two events:
:         truncate page A at offset X
:         populate page B at offset X
: If A and B are all writeback pages, then B can hit disk first and then
: be overwritten by A. Which corrupts the data at offset X from user's POV.
:
: But for hwpoison, there are no such worries. If A is poisoned, we do
: our best to isolate it as well as intercepting its IO. If the interception
: fails, it will trigger another machine check before hitting the disk.
:
: After all, poisoned A means the data at offset X is already corrupted.
: It doesn't matter if there comes another B page.

Does that make sense?

In fact even under the assumption that page won't be truncated during
writeback, nowhere except the end_writeback_io handlers can actually
safely take advantage of this assumption. There are no much such
handlers, so it's relatively easy to check them out one by one.

> > > But I just don't like this one file having all that required knowledge
> >
> > Yes that's a big problem.
> >
> > One major complexity involves classify the page into different known
> > types, by testing page flags, page_mapping, page_mapped, etc. This
> > is not avoidable.
>
> No.

If you don't know kind of page it is, how do we know to properly
isolate it? Or do you mean the current classifying code can be
simplified? Yeah that's kind of possible.

>
> > Another major complexity is on calling the isolation routines to
> > remove references from
> >         - PTE
> >         - page cache
> >         - swap cache
> >         - LRU list
> > They more or less made some assumptions on their operating environment
> > that we have to take care of.  Unfortunately these complexities are
> > also not easily resolvable.
> >
> > > (and few comments) of all the files in mm/. If you want to get rid
> >
> > I promise I'll add more comments :)
>
> OK, but they should still go in their relevant files. Or as best as
> possible. Right now it's just silly to have all this here when much
> of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.

OK, I'll bear that point in mind.

> > > of the page and don't care what it's count or dirtyness is, then
> > > truncate_inode_pages_range is the correct API to use.
> > >
> > > (or you could extract out some of it so you can call it directly on
> > > individual locked pages, if that helps).
> >
> > The patch to move over to truncate_complete_page() would like this.
> > It's not a big win indeed.
>
> No I don't mean to do this, but to move the truncate_inode_pages
> code for truncating a single, locked, page into another function
> in mm/truncate.c and then call that from here.

It seems to me that truncate_complete_page() is already the code
you want to move ;-) Or you mean more code around the call site of
truncate_complete_page()?

                        lock_page(page);

                        wait_on_page_writeback(page);
We could do this.

                        if (page_mapped(page)) {
                                unmap_mapping_range(mapping,
                                  (loff_t)page->index<<PAGE_CACHE_SHIFT,
                                  PAGE_CACHE_SIZE, 0);
                        }
We need a rather complex unmap logic.

                        if (page->index > next)
                                next = page->index;
                        next++;
                        truncate_complete_page(mapping, page);
                        unlock_page(page);

Now it's obvious that reusing more code than truncate_complete_page()
is not easy (or natural).

> > ---
> >  mm/memory-failure.c |   14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > --- sound-2.6.orig/mm/memory-failure.c
> > +++ sound-2.6/mm/memory-failure.c
> > @@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
> >  	if (!isolate_lru_page(p))
> >  		page_cache_release(p);
> >
> > -	if (page_has_private(p))
> > -		do_invalidatepage(p, 0);
> > -	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > -		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > -			page_to_pfn(p));
> > -
> >  	/*
> >  	 * remove_from_page_cache assumes (mapping && !mapped)
> >  	 */
> >  	if (page_mapping(p) && !page_mapped(p)) {
> > -		remove_from_page_cache(p);
> > -		page_cache_release(p);
> > +                ClearPageMlocked(p);
> > +                truncate_complete_page(p->mapping, p)
> >  	}
> >
> > +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > +			page_to_pfn(p));
> > +
> >  	return RECOVERED;
> >  }
> >
> >
> > > OK this is the point I was missing.
> > >
> > > Should all be commented and put into mm/swap_state.c (or somewhere that
> > > Hugh prefers).
> >
> > But I doubt Hugh will welcome moving that bits into swap*.c ;)
>
> Why not? If he has to look at it anyway, he probably rather looks
> at fewer files :)

Heh. OK if that's more convenient - not a big issue for me really.

> > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > in the known good data from disk.
> > >
> > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > swapcache anyway?
> >
> > The ClearPageUptodate() is kind of a careless addition, in the hope
> > that it will stop some random readers. Need more investigations.
>
> OK. But it just muddies the waters in the meantime, so maybe take
> such things out until there is a case for them.

OK.

> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > >
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > >
> > > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > > It is simple and should work. _Unless_ you can show it is a big problem that
> > > needs equivalently big mes to fix ;)
> >
> > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > keeping writeback page in page cache opens a small time window for
> > some one to access the page.
>
> AFAIKS there already is such a window? You're doing lock_page and such.

You know I'm such a crazy guy - I'm going to do try_lock_page() for
intercepting under read IOs 8-)

> No, it seems rather insane to do something like this here that no other
> code in the mm ever does.

Yes it's kind of insane.  I'm interested in reasoning it out though.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 14:05               ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-01 14:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> > On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > > On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > > > Hi Nick,
> > > >
> > > > > > +     /*
> > > > > > +      * remove_from_page_cache assumes (mapping && !mapped)
> > > > > > +      */
> > > > > > +     if (page_mapping(p) && !page_mapped(p)) {
> > > > > > +             remove_from_page_cache(p);
> > > > > > +             page_cache_release(p);
> > > > > > +     }
> > > > >
> > > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > > probably introduce pagecache removal vs page fault races which
> > > > > will make the kernel bug.
> > > >
> > > > We use remove_mapping() at first, then discovered that it made strong
> > > > assumption on page_count=2.
> > > >
> > > > I guess it is safe from races since we are locking the page?
> > >
> > > Yes it probably should (although you will lose get_user_pages data, but
> > > I guess that's the aim anyway).
> >
> > Yes. We (and truncate) rely heavily on this logic:
> >
> >         retry:
> >                 lock_page(page);
> >                 if (page->mapping == NULL)
> >                         goto retry;
> >                 // do something on page
> >                 unlock_page(page);
> >
> > So that we can steal/isolate a page under its page lock.
> >
> > The truncate code does wait on writeback page, but we would like to
> > isolate the page ASAP, so as to avoid someone to find it in the page
> > cache (or swap cache) and then access its content.
> >
> > I see no obvious problems to isolate a writeback page from page cache
> > or swap cache. But also I'm not sure it won't break some assumption
> > in some corner of the kernel.
>
> The problem is that then you have lost synchronization in the
> pagecache. Nothing then prevents a new page from being put
> in there and trying to do IO to or from the same device as the
> currently running writeback.

[ I'm not setting mine mind to get rid of wait_on_page_writeback(),
  however I'm curious about the consequences of not doing it :)     ]

You are right in that IO can happen for a new page at the same file offset.
But I have analyzed that in another email:

: The reason truncate_inode_pages_range() has to wait on writeback page
: is to ensure data integrity. Otherwise if there comes two events:
:         truncate page A at offset X
:         populate page B at offset X
: If A and B are all writeback pages, then B can hit disk first and then
: be overwritten by A. Which corrupts the data at offset X from user's POV.
:
: But for hwpoison, there are no such worries. If A is poisoned, we do
: our best to isolate it as well as intercepting its IO. If the interception
: fails, it will trigger another machine check before hitting the disk.
:
: After all, poisoned A means the data at offset X is already corrupted.
: It doesn't matter if there comes another B page.

Does that make sense?

In fact even under the assumption that page won't be truncated during
writeback, nowhere except the end_writeback_io handlers can actually
safely take advantage of this assumption. There are no much such
handlers, so it's relatively easy to check them out one by one.

> > > But I just don't like this one file having all that required knowledge
> >
> > Yes that's a big problem.
> >
> > One major complexity involves classify the page into different known
> > types, by testing page flags, page_mapping, page_mapped, etc. This
> > is not avoidable.
>
> No.

If you don't know kind of page it is, how do we know to properly
isolate it? Or do you mean the current classifying code can be
simplified? Yeah that's kind of possible.

>
> > Another major complexity is on calling the isolation routines to
> > remove references from
> >         - PTE
> >         - page cache
> >         - swap cache
> >         - LRU list
> > They more or less made some assumptions on their operating environment
> > that we have to take care of.  Unfortunately these complexities are
> > also not easily resolvable.
> >
> > > (and few comments) of all the files in mm/. If you want to get rid
> >
> > I promise I'll add more comments :)
>
> OK, but they should still go in their relevant files. Or as best as
> possible. Right now it's just silly to have all this here when much
> of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.

OK, I'll bear that point in mind.

> > > of the page and don't care what it's count or dirtyness is, then
> > > truncate_inode_pages_range is the correct API to use.
> > >
> > > (or you could extract out some of it so you can call it directly on
> > > individual locked pages, if that helps).
> >
> > The patch to move over to truncate_complete_page() would like this.
> > It's not a big win indeed.
>
> No I don't mean to do this, but to move the truncate_inode_pages
> code for truncating a single, locked, page into another function
> in mm/truncate.c and then call that from here.

It seems to me that truncate_complete_page() is already the code
you want to move ;-) Or you mean more code around the call site of
truncate_complete_page()?

                        lock_page(page);

                        wait_on_page_writeback(page);
We could do this.

                        if (page_mapped(page)) {
                                unmap_mapping_range(mapping,
                                  (loff_t)page->index<<PAGE_CACHE_SHIFT,
                                  PAGE_CACHE_SIZE, 0);
                        }
We need a rather complex unmap logic.

                        if (page->index > next)
                                next = page->index;
                        next++;
                        truncate_complete_page(mapping, page);
                        unlock_page(page);

Now it's obvious that reusing more code than truncate_complete_page()
is not easy (or natural).

> > ---
> >  mm/memory-failure.c |   14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > --- sound-2.6.orig/mm/memory-failure.c
> > +++ sound-2.6/mm/memory-failure.c
> > @@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
> >  	if (!isolate_lru_page(p))
> >  		page_cache_release(p);
> >
> > -	if (page_has_private(p))
> > -		do_invalidatepage(p, 0);
> > -	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > -		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > -			page_to_pfn(p));
> > -
> >  	/*
> >  	 * remove_from_page_cache assumes (mapping && !mapped)
> >  	 */
> >  	if (page_mapping(p) && !page_mapped(p)) {
> > -		remove_from_page_cache(p);
> > -		page_cache_release(p);
> > +                ClearPageMlocked(p);
> > +                truncate_complete_page(p->mapping, p)
> >  	}
> >
> > +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > +			page_to_pfn(p));
> > +
> >  	return RECOVERED;
> >  }
> >
> >
> > > OK this is the point I was missing.
> > >
> > > Should all be commented and put into mm/swap_state.c (or somewhere that
> > > Hugh prefers).
> >
> > But I doubt Hugh will welcome moving that bits into swap*.c ;)
>
> Why not? If he has to look at it anyway, he probably rather looks
> at fewer files :)

Heh. OK if that's more convenient - not a big issue for me really.

> > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > in the known good data from disk.
> > >
> > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > swapcache anyway?
> >
> > The ClearPageUptodate() is kind of a careless addition, in the hope
> > that it will stop some random readers. Need more investigations.
>
> OK. But it just muddies the waters in the meantime, so maybe take
> such things out until there is a case for them.

OK.

> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > >
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > >
> > > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > > It is simple and should work. _Unless_ you can show it is a big problem that
> > > needs equivalently big mes to fix ;)
> >
> > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > keeping writeback page in page cache opens a small time window for
> > some one to access the page.
>
> AFAIKS there already is such a window? You're doing lock_page and such.

You know I'm such a crazy guy - I'm going to do try_lock_page() for
intercepting under read IOs 8-)

> No, it seems rather insane to do something like this here that no other
> code in the mm ever does.

Yes it's kind of insane.  I'm interested in reasoning it out though.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 14:05               ` Wu Fengguang
@ 2009-06-01 14:40                 ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 14:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 10:05:53PM +0800, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > The problem is that then you have lost synchronization in the
> > pagecache. Nothing then prevents a new page from being put
> > in there and trying to do IO to or from the same device as the
> > currently running writeback.
> 
> [ I'm not setting mine mind to get rid of wait_on_page_writeback(),
>   however I'm curious about the consequences of not doing it :)     ]
> 
> You are right in that IO can happen for a new page at the same file offset.
> But I have analyzed that in another email:
> 
> : The reason truncate_inode_pages_range() has to wait on writeback page
> : is to ensure data integrity. Otherwise if there comes two events:
> :         truncate page A at offset X
> :         populate page B at offset X
> : If A and B are all writeback pages, then B can hit disk first and then
> : be overwritten by A. Which corrupts the data at offset X from user's POV.
> :
> : But for hwpoison, there are no such worries. If A is poisoned, we do
> : our best to isolate it as well as intercepting its IO. If the interception
> : fails, it will trigger another machine check before hitting the disk.
> :
> : After all, poisoned A means the data at offset X is already corrupted.
> : It doesn't matter if there comes another B page.
> 
> Does that make sense?

But you just said that you try to intercept the IO. So the underlying
data is not necessarily corrupt. And even if it was then what if it
was reinitialized to something else in the meantime (such as filesystem
metadata blocks?) You'd just be introducing worse possibilities for
coruption.

You will need to demonstrate a *big* advantage before doing crazy things
with writeback ;)
 

> > > > But I just don't like this one file having all that required knowledge
> > >
> > > Yes that's a big problem.
> > >
> > > One major complexity involves classify the page into different known
> > > types, by testing page flags, page_mapping, page_mapped, etc. This
> > > is not avoidable.
> >
> > No.
> 
> If you don't know kind of page it is, how do we know to properly
> isolate it? Or do you mean the current classifying code can be
> simplified? Yeah that's kind of possible.

No I just was agreeing that it is not avoidable ;)


> > > Another major complexity is on calling the isolation routines to
> > > remove references from
> > >         - PTE
> > >         - page cache
> > >         - swap cache
> > >         - LRU list
> > > They more or less made some assumptions on their operating environment
> > > that we have to take care of.  Unfortunately these complexities are
> > > also not easily resolvable.
> > >
> > > > (and few comments) of all the files in mm/. If you want to get rid
> > >
> > > I promise I'll add more comments :)
> >
> > OK, but they should still go in their relevant files. Or as best as
> > possible. Right now it's just silly to have all this here when much
> > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> 
> OK, I'll bear that point in mind.
> 
> > > > of the page and don't care what it's count or dirtyness is, then
> > > > truncate_inode_pages_range is the correct API to use.
> > > >
> > > > (or you could extract out some of it so you can call it directly on
> > > > individual locked pages, if that helps).
> > >
> > > The patch to move over to truncate_complete_page() would like this.
> > > It's not a big win indeed.
> >
> > No I don't mean to do this, but to move the truncate_inode_pages
> > code for truncating a single, locked, page into another function
> > in mm/truncate.c and then call that from here.
> 
> It seems to me that truncate_complete_page() is already the code
> you want to move ;-) Or you mean more code around the call site of
> truncate_complete_page()?
> 
>                         lock_page(page);
> 
>                         wait_on_page_writeback(page);
> We could do this.
> 
>                         if (page_mapped(page)) {
>                                 unmap_mapping_range(mapping,
>                                   (loff_t)page->index<<PAGE_CACHE_SHIFT,
>                                   PAGE_CACHE_SIZE, 0);
>                         }
> We need a rather complex unmap logic.
> 
>                         if (page->index > next)
>                                 next = page->index;
>                         next++;
>                         truncate_complete_page(mapping, page);
>                         unlock_page(page);
> 
> Now it's obvious that reusing more code than truncate_complete_page()
> is not easy (or natural).

Just lock the page and wait for writeback, then do the truncate
work in another function. In your case if you've already unmapped
the page then it won't try to unmap again so no problem.

Truncating from pagecache does not change ->index so you can
move the loop logic out.


> > > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > > keeping writeback page in page cache opens a small time window for
> > > some one to access the page.
> >
> > AFAIKS there already is such a window? You're doing lock_page and such.
> 
> You know I'm such a crazy guy - I'm going to do try_lock_page() for
> intercepting under read IOs 8-)
> 
> > No, it seems rather insane to do something like this here that no other
> > code in the mm ever does.
> 
> Yes it's kind of insane.  I'm interested in reasoning it out though.

I guess it is a good idea to start simple.

Considering that there are so many other types of pages that are
impossible to deal with or have holes, then I very strongly doubt
it will be worth so much complexity for closing the gap from 90%
to 90.1%. But we'll see.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 14:40                 ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-01 14:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 10:05:53PM +0800, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > The problem is that then you have lost synchronization in the
> > pagecache. Nothing then prevents a new page from being put
> > in there and trying to do IO to or from the same device as the
> > currently running writeback.
> 
> [ I'm not setting mine mind to get rid of wait_on_page_writeback(),
>   however I'm curious about the consequences of not doing it :)     ]
> 
> You are right in that IO can happen for a new page at the same file offset.
> But I have analyzed that in another email:
> 
> : The reason truncate_inode_pages_range() has to wait on writeback page
> : is to ensure data integrity. Otherwise if there comes two events:
> :         truncate page A at offset X
> :         populate page B at offset X
> : If A and B are all writeback pages, then B can hit disk first and then
> : be overwritten by A. Which corrupts the data at offset X from user's POV.
> :
> : But for hwpoison, there are no such worries. If A is poisoned, we do
> : our best to isolate it as well as intercepting its IO. If the interception
> : fails, it will trigger another machine check before hitting the disk.
> :
> : After all, poisoned A means the data at offset X is already corrupted.
> : It doesn't matter if there comes another B page.
> 
> Does that make sense?

But you just said that you try to intercept the IO. So the underlying
data is not necessarily corrupt. And even if it was then what if it
was reinitialized to something else in the meantime (such as filesystem
metadata blocks?) You'd just be introducing worse possibilities for
coruption.

You will need to demonstrate a *big* advantage before doing crazy things
with writeback ;)
 

> > > > But I just don't like this one file having all that required knowledge
> > >
> > > Yes that's a big problem.
> > >
> > > One major complexity involves classify the page into different known
> > > types, by testing page flags, page_mapping, page_mapped, etc. This
> > > is not avoidable.
> >
> > No.
> 
> If you don't know kind of page it is, how do we know to properly
> isolate it? Or do you mean the current classifying code can be
> simplified? Yeah that's kind of possible.

No I just was agreeing that it is not avoidable ;)


> > > Another major complexity is on calling the isolation routines to
> > > remove references from
> > >         - PTE
> > >         - page cache
> > >         - swap cache
> > >         - LRU list
> > > They more or less made some assumptions on their operating environment
> > > that we have to take care of.  Unfortunately these complexities are
> > > also not easily resolvable.
> > >
> > > > (and few comments) of all the files in mm/. If you want to get rid
> > >
> > > I promise I'll add more comments :)
> >
> > OK, but they should still go in their relevant files. Or as best as
> > possible. Right now it's just silly to have all this here when much
> > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> 
> OK, I'll bear that point in mind.
> 
> > > > of the page and don't care what it's count or dirtyness is, then
> > > > truncate_inode_pages_range is the correct API to use.
> > > >
> > > > (or you could extract out some of it so you can call it directly on
> > > > individual locked pages, if that helps).
> > >
> > > The patch to move over to truncate_complete_page() would like this.
> > > It's not a big win indeed.
> >
> > No I don't mean to do this, but to move the truncate_inode_pages
> > code for truncating a single, locked, page into another function
> > in mm/truncate.c and then call that from here.
> 
> It seems to me that truncate_complete_page() is already the code
> you want to move ;-) Or you mean more code around the call site of
> truncate_complete_page()?
> 
>                         lock_page(page);
> 
>                         wait_on_page_writeback(page);
> We could do this.
> 
>                         if (page_mapped(page)) {
>                                 unmap_mapping_range(mapping,
>                                   (loff_t)page->index<<PAGE_CACHE_SHIFT,
>                                   PAGE_CACHE_SIZE, 0);
>                         }
> We need a rather complex unmap logic.
> 
>                         if (page->index > next)
>                                 next = page->index;
>                         next++;
>                         truncate_complete_page(mapping, page);
>                         unlock_page(page);
> 
> Now it's obvious that reusing more code than truncate_complete_page()
> is not easy (or natural).

Just lock the page and wait for writeback, then do the truncate
work in another function. In your case if you've already unmapped
the page then it won't try to unmap again so no problem.

Truncating from pagecache does not change ->index so you can
move the loop logic out.


> > > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > > keeping writeback page in page cache opens a small time window for
> > > some one to access the page.
> >
> > AFAIKS there already is such a window? You're doing lock_page and such.
> 
> You know I'm such a crazy guy - I'm going to do try_lock_page() for
> intercepting under read IOs 8-)
> 
> > No, it seems rather insane to do something like this here that no other
> > code in the mm ever does.
> 
> Yes it's kind of insane.  I'm interested in reasoning it out though.

I guess it is a good idea to start simple.

Considering that there are so many other types of pages that are
impossible to deal with or have holes, then I very strongly doubt
it will be worth so much complexity for closing the gap from 90%
to 90.1%. But we'll see.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 11:39                 ` Nick Piggin
@ 2009-06-01 18:19                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 18:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Russ Anderson, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm, fengguang.wu

> Surely you can test it out these cases with debugging code or printks
> and then take them out of production code?

I made it a debugging printk now.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 18:19                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 18:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Russ Anderson, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm, fengguang.wu

> Surely you can test it out these cases with debugging code or printks
> and then take them out of production code?

I made it a debugging printk now.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 11:50             ` Nick Piggin
@ 2009-06-01 18:32               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 18:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Wu Fengguang, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > Another major complexity is on calling the isolation routines to
> > remove references from
> >         - PTE
> >         - page cache
> >         - swap cache
> >         - LRU list
> > They more or less made some assumptions on their operating environment
> > that we have to take care of.  Unfortunately these complexities are
> > also not easily resolvable.
> > 
> > > (and few comments) of all the files in mm/. If you want to get rid
> > 
> > I promise I'll add more comments :)
> 
> OK, but they should still go in their relevant files. Or as best as
> possible. Right now it's just silly to have all this here when much
> of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.

Can you be more specific what that "all this" is? 

> > > of the page and don't care what it's count or dirtyness is, then
> > > truncate_inode_pages_range is the correct API to use.
> > >
> > > (or you could extract out some of it so you can call it directly on
> > > individual locked pages, if that helps).
> >  
> > The patch to move over to truncate_complete_page() would like this.
> > It's not a big win indeed.
> 
> No I don't mean to do this, but to move the truncate_inode_pages
> code for truncating a single, locked, page into another function
> in mm/truncate.c and then call that from here.

I took a look at that.  First there's no direct equivalent of
me_pagecache_clean/dirty in truncate.c and to be honest I don't
see a clean way to refactor any of the existing functions to 
do the same.

Then memory-failure already calls into the other files for
pretty much anything interesting (do_invalidatepage, cancel_dirty_page,
try_to_free_mapping) -- there is very little that memory-failure.c
does on its own.

These are also all already called from all over the kernel, e.g.
there are 15+ callers of try_to_release_page outside truncate.c

For do_invalidatepage and cancel_dirty_page it's not as clear cut, but there's 
already precendence of several callers outside truncate.c.

We could presumably move the swap cache functions, but given how simple
they are and just also calling direct into the swap code anyways, is there
much value in it? Hugh, can you give guidance?

static int me_swapcache_dirty(struct page *p)
{
        ClearPageDirty(p);

        if (!isolate_lru_page(p))
                page_cache_release(p);

        return DELAYED;
}

static int me_swapcache_clean(struct page *p)
{
        ClearPageUptodate(p);

        if (!isolate_lru_page(p))
                page_cache_release(p);

        delete_from_swap_cache(p);

        return RECOVERED;
}


>  
> > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > in the known good data from disk.
> > > 
> > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > swapcache anyway?
> > 
> > The ClearPageUptodate() is kind of a careless addition, in the hope
> > that it will stop some random readers. Need more investigations.
> 
> OK. But it just muddies the waters in the meantime, so maybe take
> such things out until there is a case for them.

It's gone

> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > > 
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > > 
> > > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > > It is simple and should work. _Unless_ you can show it is a big problem that
> > > needs equivalently big mes to fix ;)
> > 
> > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > keeping writeback page in page cache opens a small time window for
> > some one to access the page.
> 
> AFAIKS there already is such a window? You're doing lock_page and such.

Yes there already is plenty of window.

> No, it seems rather insane to do something like this here that no other
> code in the mm ever does.

Just because the rest of the VM doesn't do it doesn't mean it might make sense.

But the writeback windows are probably too short to careing. I haven't
done numbers on those, if it's a significant percentage of memory in 
some workload it might be worth it, otherwise not.

But all of that would be in the future, right now I just want to get
the basic facility in.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 18:32               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 18:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Wu Fengguang, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > Another major complexity is on calling the isolation routines to
> > remove references from
> >         - PTE
> >         - page cache
> >         - swap cache
> >         - LRU list
> > They more or less made some assumptions on their operating environment
> > that we have to take care of.  Unfortunately these complexities are
> > also not easily resolvable.
> > 
> > > (and few comments) of all the files in mm/. If you want to get rid
> > 
> > I promise I'll add more comments :)
> 
> OK, but they should still go in their relevant files. Or as best as
> possible. Right now it's just silly to have all this here when much
> of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.

Can you be more specific what that "all this" is? 

> > > of the page and don't care what it's count or dirtyness is, then
> > > truncate_inode_pages_range is the correct API to use.
> > >
> > > (or you could extract out some of it so you can call it directly on
> > > individual locked pages, if that helps).
> >  
> > The patch to move over to truncate_complete_page() would like this.
> > It's not a big win indeed.
> 
> No I don't mean to do this, but to move the truncate_inode_pages
> code for truncating a single, locked, page into another function
> in mm/truncate.c and then call that from here.

I took a look at that.  First there's no direct equivalent of
me_pagecache_clean/dirty in truncate.c and to be honest I don't
see a clean way to refactor any of the existing functions to 
do the same.

Then memory-failure already calls into the other files for
pretty much anything interesting (do_invalidatepage, cancel_dirty_page,
try_to_free_mapping) -- there is very little that memory-failure.c
does on its own.

These are also all already called from all over the kernel, e.g.
there are 15+ callers of try_to_release_page outside truncate.c

For do_invalidatepage and cancel_dirty_page it's not as clear cut, but there's 
already precendence of several callers outside truncate.c.

We could presumably move the swap cache functions, but given how simple
they are and just also calling direct into the swap code anyways, is there
much value in it? Hugh, can you give guidance?

static int me_swapcache_dirty(struct page *p)
{
        ClearPageDirty(p);

        if (!isolate_lru_page(p))
                page_cache_release(p);

        return DELAYED;
}

static int me_swapcache_clean(struct page *p)
{
        ClearPageUptodate(p);

        if (!isolate_lru_page(p))
                page_cache_release(p);

        delete_from_swap_cache(p);

        return RECOVERED;
}


>  
> > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > in the known good data from disk.
> > > 
> > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > swapcache anyway?
> > 
> > The ClearPageUptodate() is kind of a careless addition, in the hope
> > that it will stop some random readers. Need more investigations.
> 
> OK. But it just muddies the waters in the meantime, so maybe take
> such things out until there is a case for them.

It's gone

> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > > 
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > > 
> > > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > > It is simple and should work. _Unless_ you can show it is a big problem that
> > > needs equivalently big mes to fix ;)
> > 
> > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > keeping writeback page in page cache opens a small time window for
> > some one to access the page.
> 
> AFAIKS there already is such a window? You're doing lock_page and such.

Yes there already is plenty of window.

> No, it seems rather insane to do something like this here that no other
> code in the mm ever does.

Just because the rest of the VM doesn't do it doesn't mean it might make sense.

But the writeback windows are probably too short to careing. I haven't
done numbers on those, if it's a significant percentage of memory in 
some workload it might be worth it, otherwise not.

But all of that would be in the future, right now I just want to get
the basic facility in.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 12:05             ` Nick Piggin
@ 2009-06-01 18:51               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 18:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Mon, Jun 01, 2009 at 02:05:38PM +0200, Nick Piggin wrote:
> On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> > On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > Then the data can not have been consumed, by DMA or otherwise? What
> > 
> > When the data was consumed we get a different machine check
> > (or a different error if it was consumed by a IO device)
> > 
> > This code right now just handles the case of "CPU detected a page is broken
> > is wrong, but hasn't consumed it yet"
> 
> OK. Out of curiosity, how often do you expect to see uncorrectable ECC
> errors?

That's a difficult question. It depends on a lot of factors, e.g. 
how much memory you have (but memory sizes are growing all the time), 
how many machines you have (large clusters tend to turn infrequent errors 
into frequent ones), how well your cooling and power supply works etc.

I can't give you a single number at this point, sorry.

> > > > > > +	/*
> > > > > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > > > > +	 */
> > > > > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > > > > +		remove_from_page_cache(p);
> > > > > > +		page_cache_release(p);
> > > > > > +	}
> > > > > 
> > > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > > probably introduce pagecache removal vs page fault races whi
> > > > > will make the kernel bug.
> > > > 
> > > > Can you be more specific about the problems?
> > > 
> > > Hmm, actually now that we hold the page lock over __do_fault (at least
> > > for pagecache pages), this may not be able to trigger the race I was
> > > thinking of (page becoming mapped). But I think still it is better
> > > to use remove_mapping which is the standard way to remove such a page.
> > 
> > I had this originally, but Fengguang redid it because there was
> > trouble with the reference count. remove_mapping always expects it to
> > be 2, which we cannot guarantee.
> 
> OK, but it should still definitely use truncate code.

It does -- see other email.

> 
> > > > > > +	if (mapping) {
> > > > > > +		/*
> > > > > > +		 * Truncate does the same, but we're not quite the same
> > > > > > +		 * as truncate. Needs more checking, but keep it for now.
> > > > > > +		 */
> > > > > 
> > > > > What's different about truncate? It would be good to reuse as much as possible.
> > > > 
> > > > Truncating removes the block on disk (we don't). Truncating shrinks
> > > > the end of the file (we don't). It's more "temporal hole punch"
> > > > Probably from the VM point of view it's very similar, but it's
> > > > not the same.
> > > 
> > > Right, I just mean the pagecache side of the truncate. So you should
> > > use truncate_inode_pages_range here.
> > 
> > Why?  I remember I was trying to use that function very early on but
> > there was some problem.  For once it does its own locking which
> > would conflict with ours.
> 
> Just extract the part where it has the page locked into a common
> function.

That doesn't do some stuff we want to do, like try_to_release_buffers
And there's the page count problem with remove_mapping

That could be probably fixed, but to be honest I'm uncomfortable
fiddling with truncate internals.


>  
> 
> > Is there anything concretely wrong with the current code?
> 
> 
> /*
>  * Truncate does the same, but we're not quite the same
>  * as truncate. Needs more checking, but keep it for now.
>  */
> 
> I guess that it duplicates this tricky truncate code and also
> says it is different (but AFAIKS it is doing exactly the same
> thing).

It's not, there are various differences (like the reference count)

> > > Well, the dirty data has never been corrupted before (ie. the data
> > > in pagecache has been OK). It was just unable to make it back to
> > > backing store. So a program could retry the write/fsync/etc or
> > > try to write the data somewhere else.
> > 
> > In theory it could, but in practice it is very unlikely it would.
> 
> very unlikely to try rewriting the page again or taking evasive
> action to save the data somewhere else? I think that's a bold
> assumption.
> 
> At the very least, having a prompt "IO error, check your hotplug
> device / network connection / etc and try again" I don't think
> sounds unreasonable at all.

I'm not sure what you're suggesting here. Are you suggesting
corrupted dirty page should return a different errno than a traditional 
IO error?

I don't think having a different errno for this would be a good
idea. Programs expect EIO on IO error and not something else.
And from the POV of the program it's an IO error.

Or are you suggesting there should be a mechanism where
an application could query about more details about a given
error it retrieved before? I think the later would be useful
in some cases, but probably hard to implement. Definitely
out of scope for this effort.

> > > Just seems overengineered. We could rewrite any if/switch statement like
> > > that (and actually the compiler probably will if it is beneficial).
> > 
> > The reason I like it is that it separates the functions cleanly,
> > without that there would be a dispatcher from hell. Yes it's a bit
> > ugly that there is a lot of manual bit checking around now too,
> > but as you go into all the corner cases originally clean code
> > always tends to get more ugly (and this is a really ugly problem)
> 
> Well... it is just writing another dispatcher from hell in a
> different way really, isn't it? How is it so much better than
> a simple switch or if/elseif/else statement?

switch () wouldn't work, it relies on the order.

if () would work, but it would be larger and IMHO much harder
to read than the table. I prefer the table, which is a reasonably
clean mechanism to do that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 18:51               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 18:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Mon, Jun 01, 2009 at 02:05:38PM +0200, Nick Piggin wrote:
> On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> > On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > Then the data can not have been consumed, by DMA or otherwise? What
> > 
> > When the data was consumed we get a different machine check
> > (or a different error if it was consumed by a IO device)
> > 
> > This code right now just handles the case of "CPU detected a page is broken
> > is wrong, but hasn't consumed it yet"
> 
> OK. Out of curiosity, how often do you expect to see uncorrectable ECC
> errors?

That's a difficult question. It depends on a lot of factors, e.g. 
how much memory you have (but memory sizes are growing all the time), 
how many machines you have (large clusters tend to turn infrequent errors 
into frequent ones), how well your cooling and power supply works etc.

I can't give you a single number at this point, sorry.

> > > > > > +	/*
> > > > > > +	 * remove_from_page_cache assumes (mapping && !mapped)
> > > > > > +	 */
> > > > > > +	if (page_mapping(p) && !page_mapped(p)) {
> > > > > > +		remove_from_page_cache(p);
> > > > > > +		page_cache_release(p);
> > > > > > +	}
> > > > > 
> > > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > > probably introduce pagecache removal vs page fault races whi
> > > > > will make the kernel bug.
> > > > 
> > > > Can you be more specific about the problems?
> > > 
> > > Hmm, actually now that we hold the page lock over __do_fault (at least
> > > for pagecache pages), this may not be able to trigger the race I was
> > > thinking of (page becoming mapped). But I think still it is better
> > > to use remove_mapping which is the standard way to remove such a page.
> > 
> > I had this originally, but Fengguang redid it because there was
> > trouble with the reference count. remove_mapping always expects it to
> > be 2, which we cannot guarantee.
> 
> OK, but it should still definitely use truncate code.

It does -- see other email.

> 
> > > > > > +	if (mapping) {
> > > > > > +		/*
> > > > > > +		 * Truncate does the same, but we're not quite the same
> > > > > > +		 * as truncate. Needs more checking, but keep it for now.
> > > > > > +		 */
> > > > > 
> > > > > What's different about truncate? It would be good to reuse as much as possible.
> > > > 
> > > > Truncating removes the block on disk (we don't). Truncating shrinks
> > > > the end of the file (we don't). It's more "temporal hole punch"
> > > > Probably from the VM point of view it's very similar, but it's
> > > > not the same.
> > > 
> > > Right, I just mean the pagecache side of the truncate. So you should
> > > use truncate_inode_pages_range here.
> > 
> > Why?  I remember I was trying to use that function very early on but
> > there was some problem.  For once it does its own locking which
> > would conflict with ours.
> 
> Just extract the part where it has the page locked into a common
> function.

That doesn't do some stuff we want to do, like try_to_release_buffers
And there's the page count problem with remove_mapping

That could be probably fixed, but to be honest I'm uncomfortable
fiddling with truncate internals.


>  
> 
> > Is there anything concretely wrong with the current code?
> 
> 
> /*
>  * Truncate does the same, but we're not quite the same
>  * as truncate. Needs more checking, but keep it for now.
>  */
> 
> I guess that it duplicates this tricky truncate code and also
> says it is different (but AFAIKS it is doing exactly the same
> thing).

It's not, there are various differences (like the reference count)

> > > Well, the dirty data has never been corrupted before (ie. the data
> > > in pagecache has been OK). It was just unable to make it back to
> > > backing store. So a program could retry the write/fsync/etc or
> > > try to write the data somewhere else.
> > 
> > In theory it could, but in practice it is very unlikely it would.
> 
> very unlikely to try rewriting the page again or taking evasive
> action to save the data somewhere else? I think that's a bold
> assumption.
> 
> At the very least, having a prompt "IO error, check your hotplug
> device / network connection / etc and try again" I don't think
> sounds unreasonable at all.

I'm not sure what you're suggesting here. Are you suggesting
corrupted dirty page should return a different errno than a traditional 
IO error?

I don't think having a different errno for this would be a good
idea. Programs expect EIO on IO error and not something else.
And from the POV of the program it's an IO error.

Or are you suggesting there should be a mechanism where
an application could query about more details about a given
error it retrieved before? I think the later would be useful
in some cases, but probably hard to implement. Definitely
out of scope for this effort.

> > > Just seems overengineered. We could rewrite any if/switch statement like
> > > that (and actually the compiler probably will if it is beneficial).
> > 
> > The reason I like it is that it separates the functions cleanly,
> > without that there would be a dispatcher from hell. Yes it's a bit
> > ugly that there is a lot of manual bit checking around now too,
> > but as you go into all the corner cases originally clean code
> > always tends to get more ugly (and this is a really ugly problem)
> 
> Well... it is just writing another dispatcher from hell in a
> different way really, isn't it? How is it so much better than
> a simple switch or if/elseif/else statement?

switch () wouldn't work, it relies on the order.

if () would work, but it would be larger and IMHO much harder
to read than the table. I prefer the table, which is a reasonably
clean mechanism to do that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 14:05               ` Wu Fengguang
@ 2009-06-01 21:11                 ` Hugh Dickins
  -1 siblings, 0 replies; 232+ messages in thread
From: Hugh Dickins @ 2009-06-01 21:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, 1 Jun 2009, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> > > On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > > >
> > > > Should all be commented and put into mm/swap_state.c (or somewhere that
> > > > Hugh prefers).
> > >
> > > But I doubt Hugh will welcome moving that bits into swap*.c ;)
> >
> > Why not? If he has to look at it anyway, he probably rather looks
> > at fewer files :)
> 
> Heh. OK if that's more convenient - not a big issue for me really.

Sorry for being so elusive, leaving you all guessing: it's kind of
you to consider me at all.  As I remarked to Andi in private mail
earlier, I'm so far behind on my promises (especially to KSM) that
I don't expect to be looking at HWPOISON for quite a while.

I don't think I'd mind about the number of files to look at.

Generally I agree with Nick, wanting the rmap-ish code to be in rmap.c
and the swap_state-ish code to be in swap_state.c and the swapfile-ish
code to be in swapfile.c etc.  (Though it's an acquired skill to work
out which is which of those two - one thing you can be sure of though,
if it's swap-related code, swap.c is strangely not the place for it.
Yikes, someone put swap_setup in there.)

But like most of us, I'm not so keen on #ifdefs: am I right to think
that if you distribute the hwpoison code around in its appropriate
source files, we'll have a nasty rash of #ifdefs all over?  We can
sometimes get away with the optimizer removing what's not needed,
but that only works in the simpler cases.

Maybe we should start out, as you have, with most of the hwpoison
code located in one file (rather like with migrate.c?); but hope
to refactor things and distribute it over time.

How seriously does the hwpoison work interfere with the assumptions
in other sourcefiles?  If it's playing tricks liable to confuse
someone reading through those other files, then it would be better
to place the hwpoison code in those files, even though #ifdefed.

There, how's that for a frustratingly equivocal answer?

Hugh

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 21:11                 ` Hugh Dickins
  0 siblings, 0 replies; 232+ messages in thread
From: Hugh Dickins @ 2009-06-01 21:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, 1 Jun 2009, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> > > On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > > >
> > > > Should all be commented and put into mm/swap_state.c (or somewhere that
> > > > Hugh prefers).
> > >
> > > But I doubt Hugh will welcome moving that bits into swap*.c ;)
> >
> > Why not? If he has to look at it anyway, he probably rather looks
> > at fewer files :)
> 
> Heh. OK if that's more convenient - not a big issue for me really.

Sorry for being so elusive, leaving you all guessing: it's kind of
you to consider me at all.  As I remarked to Andi in private mail
earlier, I'm so far behind on my promises (especially to KSM) that
I don't expect to be looking at HWPOISON for quite a while.

I don't think I'd mind about the number of files to look at.

Generally I agree with Nick, wanting the rmap-ish code to be in rmap.c
and the swap_state-ish code to be in swap_state.c and the swapfile-ish
code to be in swapfile.c etc.  (Though it's an acquired skill to work
out which is which of those two - one thing you can be sure of though,
if it's swap-related code, swap.c is strangely not the place for it.
Yikes, someone put swap_setup in there.)

But like most of us, I'm not so keen on #ifdefs: am I right to think
that if you distribute the hwpoison code around in its appropriate
source files, we'll have a nasty rash of #ifdefs all over?  We can
sometimes get away with the optimizer removing what's not needed,
but that only works in the simpler cases.

Maybe we should start out, as you have, with most of the hwpoison
code located in one file (rather like with migrate.c?); but hope
to refactor things and distribute it over time.

How seriously does the hwpoison work interfere with the assumptions
in other sourcefiles?  If it's playing tricks liable to confuse
someone reading through those other files, then it would be better
to place the hwpoison code in those files, even though #ifdefed.

There, how's that for a frustratingly equivocal answer?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 21:11                 ` Hugh Dickins
@ 2009-06-01 21:41                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 21:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Wu Fengguang, Nick Piggin, Andi Kleen, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> Maybe we should start out, as you have, with most of the hwpoison
> code located in one file (rather like with migrate.c?); but hope
> to refactor things and distribute it over time.

That's the plan.

> 
> How seriously does the hwpoison work interfere with the assumptions
> in other sourcefiles?  If it's playing tricks liable to confuse

I hope only minimally. It can come in at any time (that's the partly
tricky part), but in general it tries to be very conservative and
non intrusive and just calls code used elsewhere.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-01 21:41                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-01 21:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Wu Fengguang, Nick Piggin, Andi Kleen, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> Maybe we should start out, as you have, with most of the hwpoison
> code located in one file (rather like with migrate.c?); but hope
> to refactor things and distribute it over time.

That's the plan.

> 
> How seriously does the hwpoison work interfere with the assumptions
> in other sourcefiles?  If it's playing tricks liable to confuse

I hope only minimally. It can come in at any time (that's the partly
tricky part), but in general it tries to be very conservative and
non intrusive and just calls code used elsewhere.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 14:40                 ` Nick Piggin
@ 2009-06-02 11:14                   ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 11:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> On Mon, Jun 01, 2009 at 10:05:53PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > > The problem is that then you have lost synchronization in the
> > > pagecache. Nothing then prevents a new page from being put
> > > in there and trying to do IO to or from the same device as the
> > > currently running writeback.
> > 
> > [ I'm not setting mine mind to get rid of wait_on_page_writeback(),
> >   however I'm curious about the consequences of not doing it :)     ]
> > 
> > You are right in that IO can happen for a new page at the same file offset.
> > But I have analyzed that in another email:
> > 
> > : The reason truncate_inode_pages_range() has to wait on writeback page
> > : is to ensure data integrity. Otherwise if there comes two events:
> > :         truncate page A at offset X
> > :         populate page B at offset X
> > : If A and B are all writeback pages, then B can hit disk first and then
> > : be overwritten by A. Which corrupts the data at offset X from user's POV.
> > :
> > : But for hwpoison, there are no such worries. If A is poisoned, we do
> > : our best to isolate it as well as intercepting its IO. If the interception
> > : fails, it will trigger another machine check before hitting the disk.
> > :
> > : After all, poisoned A means the data at offset X is already corrupted.
> > : It doesn't matter if there comes another B page.
> > 
> > Does that make sense?
> 
> But you just said that you try to intercept the IO. So the underlying
> data is not necessarily corrupt. And even if it was then what if it
> was reinitialized to something else in the meantime (such as filesystem
> metadata blocks?) You'd just be introducing worse possibilities for
> coruption.

The IO interception will be based on PFN instead of file offset, so it
won't affect innocent pages such as your example of reinitialized data.

poisoned dirty page == corrupt data      => process shall be killed
poisoned clean page == recoverable data  => process shall survive

In the case of dirty hwpoison page, if we reload the on disk old data
and let application proceed with it, it may lead to *silent* data
corruption/inconsistency, because the application will first see v2
then v1, which is illogical and hence may mess up its internal data
structure.

> You will need to demonstrate a *big* advantage before doing crazy things
> with writeback ;)

OK. We can do two things about poisoned writeback pages:

1) to stop IO for them, thus avoid corrupted data to hit disk and/or
   trigger further machine checks
2) to isolate them from page cache, thus preventing possible
   references in the writeback time window

1) is important, because there may be many writeback pages in a
   production system.

2) is good to have if possible, because the time window may grow large
   when the writeback IO queue is congested or the async write
   requests are hold off by many sync read/write requests.

> > > > > But I just don't like this one file having all that required knowledge
> > > >
> > > > Yes that's a big problem.
> > > >
> > > > One major complexity involves classify the page into different known
> > > > types, by testing page flags, page_mapping, page_mapped, etc. This
> > > > is not avoidable.
> > >
> > > No.
> > 
> > If you don't know kind of page it is, how do we know to properly
> > isolate it? Or do you mean the current classifying code can be
> > simplified? Yeah that's kind of possible.
> 
> No I just was agreeing that it is not avoidable ;)

Ah OK.

> > > > Another major complexity is on calling the isolation routines to
> > > > remove references from
> > > >         - PTE
> > > >         - page cache
> > > >         - swap cache
> > > >         - LRU list
> > > > They more or less made some assumptions on their operating environment
> > > > that we have to take care of.  Unfortunately these complexities are
> > > > also not easily resolvable.
> > > >
> > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > >
> > > > I promise I'll add more comments :)
> > >
> > > OK, but they should still go in their relevant files. Or as best as
> > > possible. Right now it's just silly to have all this here when much
> > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > 
> > OK, I'll bear that point in mind.
> > 
> > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > truncate_inode_pages_range is the correct API to use.
> > > > >
> > > > > (or you could extract out some of it so you can call it directly on
> > > > > individual locked pages, if that helps).
> > > >
> > > > The patch to move over to truncate_complete_page() would like this.
> > > > It's not a big win indeed.
> > >
> > > No I don't mean to do this, but to move the truncate_inode_pages
> > > code for truncating a single, locked, page into another function
> > > in mm/truncate.c and then call that from here.
> > 
> > It seems to me that truncate_complete_page() is already the code
> > you want to move ;-) Or you mean more code around the call site of
> > truncate_complete_page()?
> > 
> >                         lock_page(page);
> > 
> >                         wait_on_page_writeback(page);
> > We could do this.
> > 
> >                         if (page_mapped(page)) {
> >                                 unmap_mapping_range(mapping,
> >                                   (loff_t)page->index<<PAGE_CACHE_SHIFT,
> >                                   PAGE_CACHE_SIZE, 0);
> >                         }
> > We need a rather complex unmap logic.
> > 
> >                         if (page->index > next)
> >                                 next = page->index;
> >                         next++;
> >                         truncate_complete_page(mapping, page);
> >                         unlock_page(page);
> > 
> > Now it's obvious that reusing more code than truncate_complete_page()
> > is not easy (or natural).
> 
> Just lock the page and wait for writeback, then do the truncate
> work in another function. In your case if you've already unmapped
> the page then it won't try to unmap again so no problem.
> 
> Truncating from pagecache does not change ->index so you can
> move the loop logic out.

Right. So effectively the reusable function is exactly
truncate_complete_page(). As I said this reuse is not a big gain.

> > > > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > > > keeping writeback page in page cache opens a small time window for
> > > > some one to access the page.
> > >
> > > AFAIKS there already is such a window? You're doing lock_page and such.
> > 
> > You know I'm such a crazy guy - I'm going to do try_lock_page() for
> > intercepting under read IOs 8-)
> > 
> > > No, it seems rather insane to do something like this here that no other
> > > code in the mm ever does.
> > 
> > Yes it's kind of insane.  I'm interested in reasoning it out though.
> 
> I guess it is a good idea to start simple.

Agreed.

> Considering that there are so many other types of pages that are
> impossible to deal with or have holes, then I very strongly doubt
> it will be worth so much complexity for closing the gap from 90%
> to 90.1%. But we'll see.

Yes, the plan is to first focus on the more important cases.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 11:14                   ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 11:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> On Mon, Jun 01, 2009 at 10:05:53PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > > The problem is that then you have lost synchronization in the
> > > pagecache. Nothing then prevents a new page from being put
> > > in there and trying to do IO to or from the same device as the
> > > currently running writeback.
> > 
> > [ I'm not setting mine mind to get rid of wait_on_page_writeback(),
> >   however I'm curious about the consequences of not doing it :)     ]
> > 
> > You are right in that IO can happen for a new page at the same file offset.
> > But I have analyzed that in another email:
> > 
> > : The reason truncate_inode_pages_range() has to wait on writeback page
> > : is to ensure data integrity. Otherwise if there comes two events:
> > :         truncate page A at offset X
> > :         populate page B at offset X
> > : If A and B are all writeback pages, then B can hit disk first and then
> > : be overwritten by A. Which corrupts the data at offset X from user's POV.
> > :
> > : But for hwpoison, there are no such worries. If A is poisoned, we do
> > : our best to isolate it as well as intercepting its IO. If the interception
> > : fails, it will trigger another machine check before hitting the disk.
> > :
> > : After all, poisoned A means the data at offset X is already corrupted.
> > : It doesn't matter if there comes another B page.
> > 
> > Does that make sense?
> 
> But you just said that you try to intercept the IO. So the underlying
> data is not necessarily corrupt. And even if it was then what if it
> was reinitialized to something else in the meantime (such as filesystem
> metadata blocks?) You'd just be introducing worse possibilities for
> coruption.

The IO interception will be based on PFN instead of file offset, so it
won't affect innocent pages such as your example of reinitialized data.

poisoned dirty page == corrupt data      => process shall be killed
poisoned clean page == recoverable data  => process shall survive

In the case of dirty hwpoison page, if we reload the on disk old data
and let application proceed with it, it may lead to *silent* data
corruption/inconsistency, because the application will first see v2
then v1, which is illogical and hence may mess up its internal data
structure.

> You will need to demonstrate a *big* advantage before doing crazy things
> with writeback ;)

OK. We can do two things about poisoned writeback pages:

1) to stop IO for them, thus avoid corrupted data to hit disk and/or
   trigger further machine checks
2) to isolate them from page cache, thus preventing possible
   references in the writeback time window

1) is important, because there may be many writeback pages in a
   production system.

2) is good to have if possible, because the time window may grow large
   when the writeback IO queue is congested or the async write
   requests are hold off by many sync read/write requests.

> > > > > But I just don't like this one file having all that required knowledge
> > > >
> > > > Yes that's a big problem.
> > > >
> > > > One major complexity involves classify the page into different known
> > > > types, by testing page flags, page_mapping, page_mapped, etc. This
> > > > is not avoidable.
> > >
> > > No.
> > 
> > If you don't know kind of page it is, how do we know to properly
> > isolate it? Or do you mean the current classifying code can be
> > simplified? Yeah that's kind of possible.
> 
> No I just was agreeing that it is not avoidable ;)

Ah OK.

> > > > Another major complexity is on calling the isolation routines to
> > > > remove references from
> > > >         - PTE
> > > >         - page cache
> > > >         - swap cache
> > > >         - LRU list
> > > > They more or less made some assumptions on their operating environment
> > > > that we have to take care of.  Unfortunately these complexities are
> > > > also not easily resolvable.
> > > >
> > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > >
> > > > I promise I'll add more comments :)
> > >
> > > OK, but they should still go in their relevant files. Or as best as
> > > possible. Right now it's just silly to have all this here when much
> > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > 
> > OK, I'll bear that point in mind.
> > 
> > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > truncate_inode_pages_range is the correct API to use.
> > > > >
> > > > > (or you could extract out some of it so you can call it directly on
> > > > > individual locked pages, if that helps).
> > > >
> > > > The patch to move over to truncate_complete_page() would like this.
> > > > It's not a big win indeed.
> > >
> > > No I don't mean to do this, but to move the truncate_inode_pages
> > > code for truncating a single, locked, page into another function
> > > in mm/truncate.c and then call that from here.
> > 
> > It seems to me that truncate_complete_page() is already the code
> > you want to move ;-) Or you mean more code around the call site of
> > truncate_complete_page()?
> > 
> >                         lock_page(page);
> > 
> >                         wait_on_page_writeback(page);
> > We could do this.
> > 
> >                         if (page_mapped(page)) {
> >                                 unmap_mapping_range(mapping,
> >                                   (loff_t)page->index<<PAGE_CACHE_SHIFT,
> >                                   PAGE_CACHE_SIZE, 0);
> >                         }
> > We need a rather complex unmap logic.
> > 
> >                         if (page->index > next)
> >                                 next = page->index;
> >                         next++;
> >                         truncate_complete_page(mapping, page);
> >                         unlock_page(page);
> > 
> > Now it's obvious that reusing more code than truncate_complete_page()
> > is not easy (or natural).
> 
> Just lock the page and wait for writeback, then do the truncate
> work in another function. In your case if you've already unmapped
> the page then it won't try to unmap again so no problem.
> 
> Truncating from pagecache does not change ->index so you can
> move the loop logic out.

Right. So effectively the reusable function is exactly
truncate_complete_page(). As I said this reuse is not a big gain.

> > > > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > > > keeping writeback page in page cache opens a small time window for
> > > > some one to access the page.
> > >
> > > AFAIKS there already is such a window? You're doing lock_page and such.
> > 
> > You know I'm such a crazy guy - I'm going to do try_lock_page() for
> > intercepting under read IOs 8-)
> > 
> > > No, it seems rather insane to do something like this here that no other
> > > code in the mm ever does.
> > 
> > Yes it's kind of insane.  I'm interested in reasoning it out though.
> 
> I guess it is a good idea to start simple.

Agreed.

> Considering that there are so many other types of pages that are
> impossible to deal with or have holes, then I very strongly doubt
> it will be worth so much complexity for closing the gap from 90%
> to 90.1%. But we'll see.

Yes, the plan is to first focus on the more important cases.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 18:32               ` Andi Kleen
@ 2009-06-02 12:00                 ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 08:32:25PM +0200, Andi Kleen wrote:
> On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > Another major complexity is on calling the isolation routines to
> > > remove references from
> > >         - PTE
> > >         - page cache
> > >         - swap cache
> > >         - LRU list
> > > They more or less made some assumptions on their operating environment
> > > that we have to take care of.  Unfortunately these complexities are
> > > also not easily resolvable.
> > > 
> > > > (and few comments) of all the files in mm/. If you want to get rid
> > > 
> > > I promise I'll add more comments :)
> > 
> > OK, but they should still go in their relevant files. Or as best as
> > possible. Right now it's just silly to have all this here when much
> > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> 
> Can you be more specific what that "all this" is? 

The functions which take action in response to a bad page being 
detected. They belong with the subsystem that the page belongs
to. I'm amazed this is causing so much argument or confusion
because it is how the rest of mm/ code is arranged. OK, Hugh has
a point about ifdefs, but OTOH we have lots of ifdefs like this.

 
> > > > of the page and don't care what it's count or dirtyness is, then
> > > > truncate_inode_pages_range is the correct API to use.
> > > >
> > > > (or you could extract out some of it so you can call it directly on
> > > > individual locked pages, if that helps).
> > >  
> > > The patch to move over to truncate_complete_page() would like this.
> > > It's not a big win indeed.
> > 
> > No I don't mean to do this, but to move the truncate_inode_pages
> > code for truncating a single, locked, page into another function
> > in mm/truncate.c and then call that from here.
> 
> I took a look at that.  First there's no direct equivalent of
> me_pagecache_clean/dirty in truncate.c and to be honest I don't
> see a clean way to refactor any of the existing functions to 
> do the same.

With all that writing you could have just done it. It's really
not a big deal and just avoids duplicating code. I attached an
(untested) patch.


> > No, it seems rather insane to do something like this here that no other
> > code in the mm ever does.
> 
> Just because the rest of the VM doesn't do it doesn't mean it might make sense.

It is going to be possible to do it somehow surely, but it is insane
to try to add such constraints to the VM to close a few small windows
if you already have other large ones.

---
 mm/truncate.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -135,6 +135,16 @@ invalidate_complete_page(struct address_
 	return ret;
 }
 
+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+	if (page_mapped(page)) {
+		unmap_mapping_range(mapping,
+		  (loff_t)page->index<<PAGE_CACHE_SHIFT,
+		  PAGE_CACHE_SIZE, 0);
+	}
+	truncate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -196,12 +206,7 @@ void truncate_inode_pages_range(struct a
 				unlock_page(page);
 				continue;
 			}
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
-			truncate_complete_page(mapping, page);
+			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
@@ -238,15 +243,10 @@ void truncate_inode_pages_range(struct a
 				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page->index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
+			truncate_inode_page(mappng, page);
 			if (page->index > next)
 				next = page->index;
 			next++;
-			truncate_complete_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:00                 ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Mon, Jun 01, 2009 at 08:32:25PM +0200, Andi Kleen wrote:
> On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > Another major complexity is on calling the isolation routines to
> > > remove references from
> > >         - PTE
> > >         - page cache
> > >         - swap cache
> > >         - LRU list
> > > They more or less made some assumptions on their operating environment
> > > that we have to take care of.  Unfortunately these complexities are
> > > also not easily resolvable.
> > > 
> > > > (and few comments) of all the files in mm/. If you want to get rid
> > > 
> > > I promise I'll add more comments :)
> > 
> > OK, but they should still go in their relevant files. Or as best as
> > possible. Right now it's just silly to have all this here when much
> > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> 
> Can you be more specific what that "all this" is? 

The functions which take action in response to a bad page being 
detected. They belong with the subsystem that the page belongs
to. I'm amazed this is causing so much argument or confusion
because it is how the rest of mm/ code is arranged. OK, Hugh has
a point about ifdefs, but OTOH we have lots of ifdefs like this.

 
> > > > of the page and don't care what it's count or dirtyness is, then
> > > > truncate_inode_pages_range is the correct API to use.
> > > >
> > > > (or you could extract out some of it so you can call it directly on
> > > > individual locked pages, if that helps).
> > >  
> > > The patch to move over to truncate_complete_page() would like this.
> > > It's not a big win indeed.
> > 
> > No I don't mean to do this, but to move the truncate_inode_pages
> > code for truncating a single, locked, page into another function
> > in mm/truncate.c and then call that from here.
> 
> I took a look at that.  First there's no direct equivalent of
> me_pagecache_clean/dirty in truncate.c and to be honest I don't
> see a clean way to refactor any of the existing functions to 
> do the same.

With all that writing you could have just done it. It's really
not a big deal and just avoids duplicating code. I attached an
(untested) patch.


> > No, it seems rather insane to do something like this here that no other
> > code in the mm ever does.
> 
> Just because the rest of the VM doesn't do it doesn't mean it might make sense.

It is going to be possible to do it somehow surely, but it is insane
to try to add such constraints to the VM to close a few small windows
if you already have other large ones.

---
 mm/truncate.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -135,6 +135,16 @@ invalidate_complete_page(struct address_
 	return ret;
 }
 
+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+	if (page_mapped(page)) {
+		unmap_mapping_range(mapping,
+		  (loff_t)page->index<<PAGE_CACHE_SHIFT,
+		  PAGE_CACHE_SIZE, 0);
+	}
+	truncate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -196,12 +206,7 @@ void truncate_inode_pages_range(struct a
 				unlock_page(page);
 				continue;
 			}
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
-			truncate_complete_page(mapping, page);
+			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
@@ -238,15 +243,10 @@ void truncate_inode_pages_range(struct a
 				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page->index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
+			truncate_inode_page(mappng, page);
 			if (page->index > next)
 				next = page->index;
 			next++;
-			truncate_complete_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 18:51               ` Andi Kleen
@ 2009-06-02 12:10                 ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Mon, Jun 01, 2009 at 08:51:47PM +0200, Andi Kleen wrote:
> On Mon, Jun 01, 2009 at 02:05:38PM +0200, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> > > On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > > Then the data can not have been consumed, by DMA or otherwise? What
> > > 
> > > When the data was consumed we get a different machine check
> > > (or a different error if it was consumed by a IO device)
> > > 
> > > This code right now just handles the case of "CPU detected a page is broken
> > > is wrong, but hasn't consumed it yet"
> > 
> > OK. Out of curiosity, how often do you expect to see uncorrectable ECC
> > errors?
> 
> That's a difficult question. It depends on a lot of factors, e.g. 
> how much memory you have (but memory sizes are growing all the time), 
> how many machines you have (large clusters tend to turn infrequent errors 
> into frequent ones), how well your cooling and power supply works etc.
> 
> I can't give you a single number at this point, sorry.

That's OK. I don't doubt they happen, I was just curious. Although
for me with my pesant amounts of memory I never even see corrected
transient ECC errors ;)

 
> > Just extract the part where it has the page locked into a common
> > function.
> 
> That doesn't do some stuff we want to do, like try_to_release_buffers
> And there's the page count problem with remove_mapping
> 
> That could be probably fixed, but to be honest I'm uncomfortable
> fiddling with truncate internals.

You're looking at invalidate, which is different. See my
last patch.


> > > Is there anything concretely wrong with the current code?
> > 
> > 
> > /*
> >  * Truncate does the same, but we're not quite the same
> >  * as truncate. Needs more checking, but keep it for now.
> >  */
> > 
> > I guess that it duplicates this tricky truncate code and also
> > says it is different (but AFAIKS it is doing exactly the same
> > thing).
> 
> It's not, there are various differences (like the reference count)

No. If there are, then it *really* needs better documentation. I
don't think there are, though.

 
> > > > Well, the dirty data has never been corrupted before (ie. the data
> > > > in pagecache has been OK). It was just unable to make it back to
> > > > backing store. So a program could retry the write/fsync/etc or
> > > > try to write the data somewhere else.
> > > 
> > > In theory it could, but in practice it is very unlikely it would.
> > 
> > very unlikely to try rewriting the page again or taking evasive
> > action to save the data somewhere else? I think that's a bold
> > assumption.
> > 
> > At the very least, having a prompt "IO error, check your hotplug
> > device / network connection / etc and try again" I don't think
> > sounds unreasonable at all.
> 
> I'm not sure what you're suggesting here. Are you suggesting
> corrupted dirty page should return a different errno than a traditional 
> IO error?
> 
> I don't think having a different errno for this would be a good
> idea. Programs expect EIO on IO error and not something else.
> And from the POV of the program it's an IO error.
> 
> Or are you suggesting there should be a mechanism where
> an application could query about more details about a given
> error it retrieved before? I think the later would be useful
> in some cases, but probably hard to implement. Definitely
> out of scope for this effort.

I'm suggesting that EIO is traditionally for when the data still
dirty in pagecache and was not able to get back to backing
store. Do you deny that?

And I think the application might try to handle the case of a
page becoming corrupted differently. Do you deny that?

OK, given the range of errors that APIs are defined to return,
then maybe EIO is the best option. I don't suppose it is possible
to expand them to return something else?


> > > > Just seems overengineered. We could rewrite any if/switch statement like
> > > > that (and actually the compiler probably will if it is beneficial).
> > > 
> > > The reason I like it is that it separates the functions cleanly,
> > > without that there would be a dispatcher from hell. Yes it's a bit
> > > ugly that there is a lot of manual bit checking around now too,
> > > but as you go into all the corner cases originally clean code
> > > always tends to get more ugly (and this is a really ugly problem)
> > 
> > Well... it is just writing another dispatcher from hell in a
> > different way really, isn't it? How is it so much better than
> > a simple switch or if/elseif/else statement?
> 
> switch () wouldn't work, it relies on the order.

Order of what?


> if () would work, but it would be larger and IMHO much harder
> to read than the table. I prefer the table, which is a reasonably
> clean mechanism to do that.

OK. Maybe I'll try send a patch to remove it and see how it looks
sometime. I think it's totally overengineered, but that's just me.



^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:10                 ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Mon, Jun 01, 2009 at 08:51:47PM +0200, Andi Kleen wrote:
> On Mon, Jun 01, 2009 at 02:05:38PM +0200, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 03:45:20PM +0200, Andi Kleen wrote:
> > > On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> > > > Then the data can not have been consumed, by DMA or otherwise? What
> > > 
> > > When the data was consumed we get a different machine check
> > > (or a different error if it was consumed by a IO device)
> > > 
> > > This code right now just handles the case of "CPU detected a page is broken
> > > is wrong, but hasn't consumed it yet"
> > 
> > OK. Out of curiosity, how often do you expect to see uncorrectable ECC
> > errors?
> 
> That's a difficult question. It depends on a lot of factors, e.g. 
> how much memory you have (but memory sizes are growing all the time), 
> how many machines you have (large clusters tend to turn infrequent errors 
> into frequent ones), how well your cooling and power supply works etc.
> 
> I can't give you a single number at this point, sorry.

That's OK. I don't doubt they happen, I was just curious. Although
for me with my pesant amounts of memory I never even see corrected
transient ECC errors ;)

 
> > Just extract the part where it has the page locked into a common
> > function.
> 
> That doesn't do some stuff we want to do, like try_to_release_buffers
> And there's the page count problem with remove_mapping
> 
> That could be probably fixed, but to be honest I'm uncomfortable
> fiddling with truncate internals.

You're looking at invalidate, which is different. See my
last patch.


> > > Is there anything concretely wrong with the current code?
> > 
> > 
> > /*
> >  * Truncate does the same, but we're not quite the same
> >  * as truncate. Needs more checking, but keep it for now.
> >  */
> > 
> > I guess that it duplicates this tricky truncate code and also
> > says it is different (but AFAIKS it is doing exactly the same
> > thing).
> 
> It's not, there are various differences (like the reference count)

No. If there are, then it *really* needs better documentation. I
don't think there are, though.

 
> > > > Well, the dirty data has never been corrupted before (ie. the data
> > > > in pagecache has been OK). It was just unable to make it back to
> > > > backing store. So a program could retry the write/fsync/etc or
> > > > try to write the data somewhere else.
> > > 
> > > In theory it could, but in practice it is very unlikely it would.
> > 
> > very unlikely to try rewriting the page again or taking evasive
> > action to save the data somewhere else? I think that's a bold
> > assumption.
> > 
> > At the very least, having a prompt "IO error, check your hotplug
> > device / network connection / etc and try again" I don't think
> > sounds unreasonable at all.
> 
> I'm not sure what you're suggesting here. Are you suggesting
> corrupted dirty page should return a different errno than a traditional 
> IO error?
> 
> I don't think having a different errno for this would be a good
> idea. Programs expect EIO on IO error and not something else.
> And from the POV of the program it's an IO error.
> 
> Or are you suggesting there should be a mechanism where
> an application could query about more details about a given
> error it retrieved before? I think the later would be useful
> in some cases, but probably hard to implement. Definitely
> out of scope for this effort.

I'm suggesting that EIO is traditionally for when the data still
dirty in pagecache and was not able to get back to backing
store. Do you deny that?

And I think the application might try to handle the case of a
page becoming corrupted differently. Do you deny that?

OK, given the range of errors that APIs are defined to return,
then maybe EIO is the best option. I don't suppose it is possible
to expand them to return something else?


> > > > Just seems overengineered. We could rewrite any if/switch statement like
> > > > that (and actually the compiler probably will if it is beneficial).
> > > 
> > > The reason I like it is that it separates the functions cleanly,
> > > without that there would be a dispatcher from hell. Yes it's a bit
> > > ugly that there is a lot of manual bit checking around now too,
> > > but as you go into all the corner cases originally clean code
> > > always tends to get more ugly (and this is a really ugly problem)
> > 
> > Well... it is just writing another dispatcher from hell in a
> > different way really, isn't it? How is it so much better than
> > a simple switch or if/elseif/else statement?
> 
> switch () wouldn't work, it relies on the order.

Order of what?


> if () would work, but it would be larger and IMHO much harder
> to read than the table. I prefer the table, which is a reasonably
> clean mechanism to do that.

OK. Maybe I'll try send a patch to remove it and see how it looks
sometime. I think it's totally overengineered, but that's just me.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 11:14                   ` Wu Fengguang
@ 2009-06-02 12:19                     ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > But you just said that you try to intercept the IO. So the underlying
> > data is not necessarily corrupt. And even if it was then what if it
> > was reinitialized to something else in the meantime (such as filesystem
> > metadata blocks?) You'd just be introducing worse possibilities for
> > coruption.
> 
> The IO interception will be based on PFN instead of file offset, so it
> won't affect innocent pages such as your example of reinitialized data.

OK, if you could intercept the IO so it never happens at all, yes
of course that could work.


> poisoned dirty page == corrupt data      => process shall be killed
> poisoned clean page == recoverable data  => process shall survive
> 
> In the case of dirty hwpoison page, if we reload the on disk old data
> and let application proceed with it, it may lead to *silent* data
> corruption/inconsistency, because the application will first see v2
> then v1, which is illogical and hence may mess up its internal data
> structure.

Right, but how do you prevent that? There is no way to reconstruct the
most updtodate data because it was destroyed.

 
> > You will need to demonstrate a *big* advantage before doing crazy things
> > with writeback ;)
> 
> OK. We can do two things about poisoned writeback pages:
> 
> 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
>    trigger further machine checks

1b) At which point, you invoke the end-io handlers, and the page is
no longer writeback.

> 2) to isolate them from page cache, thus preventing possible
>    references in the writeback time window

And then this is possible because you aren't violating mm
assumptions due to 1b. This proceeds just as the existing
pagecache mce error handler case which exists now.

 
> > > Now it's obvious that reusing more code than truncate_complete_page()
> > > is not easy (or natural).
> > 
> > Just lock the page and wait for writeback, then do the truncate
> > work in another function. In your case if you've already unmapped
> > the page then it won't try to unmap again so no problem.
> > 
> > Truncating from pagecache does not change ->index so you can
> > move the loop logic out.
> 
> Right. So effectively the reusable function is exactly
> truncate_complete_page(). As I said this reuse is not a big gain.

Anyway, we don't have to argue about it. I already send a patch
because it was so hard to do, so let's move past this ;)


> > > Yes it's kind of insane.  I'm interested in reasoning it out though.

Well with the IO interception (I missed this point), then it seems
maybe no longer so insane. We could see how it looks.


> > I guess it is a good idea to start simple.
> 
> Agreed.
> 
> > Considering that there are so many other types of pages that are
> > impossible to deal with or have holes, then I very strongly doubt
> > it will be worth so much complexity for closing the gap from 90%
> > to 90.1%. But we'll see.
> 
> Yes, the plan is to first focus on the more important cases.

Great.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:19                     ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > But you just said that you try to intercept the IO. So the underlying
> > data is not necessarily corrupt. And even if it was then what if it
> > was reinitialized to something else in the meantime (such as filesystem
> > metadata blocks?) You'd just be introducing worse possibilities for
> > coruption.
> 
> The IO interception will be based on PFN instead of file offset, so it
> won't affect innocent pages such as your example of reinitialized data.

OK, if you could intercept the IO so it never happens at all, yes
of course that could work.


> poisoned dirty page == corrupt data      => process shall be killed
> poisoned clean page == recoverable data  => process shall survive
> 
> In the case of dirty hwpoison page, if we reload the on disk old data
> and let application proceed with it, it may lead to *silent* data
> corruption/inconsistency, because the application will first see v2
> then v1, which is illogical and hence may mess up its internal data
> structure.

Right, but how do you prevent that? There is no way to reconstruct the
most updtodate data because it was destroyed.

 
> > You will need to demonstrate a *big* advantage before doing crazy things
> > with writeback ;)
> 
> OK. We can do two things about poisoned writeback pages:
> 
> 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
>    trigger further machine checks

1b) At which point, you invoke the end-io handlers, and the page is
no longer writeback.

> 2) to isolate them from page cache, thus preventing possible
>    references in the writeback time window

And then this is possible because you aren't violating mm
assumptions due to 1b. This proceeds just as the existing
pagecache mce error handler case which exists now.

 
> > > Now it's obvious that reusing more code than truncate_complete_page()
> > > is not easy (or natural).
> > 
> > Just lock the page and wait for writeback, then do the truncate
> > work in another function. In your case if you've already unmapped
> > the page then it won't try to unmap again so no problem.
> > 
> > Truncating from pagecache does not change ->index so you can
> > move the loop logic out.
> 
> Right. So effectively the reusable function is exactly
> truncate_complete_page(). As I said this reuse is not a big gain.

Anyway, we don't have to argue about it. I already send a patch
because it was so hard to do, so let's move past this ;)


> > > Yes it's kind of insane.  I'm interested in reasoning it out though.

Well with the IO interception (I missed this point), then it seems
maybe no longer so insane. We could see how it looks.


> > I guess it is a good idea to start simple.
> 
> Agreed.
> 
> > Considering that there are so many other types of pages that are
> > impossible to deal with or have holes, then I very strongly doubt
> > it will be worth so much complexity for closing the gap from 90%
> > to 90.1%. But we'll see.
> 
> Yes, the plan is to first focus on the more important cases.

Great.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:10                 ` Nick Piggin
@ 2009-06-02 12:34                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 12:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:10:31PM +0200, Nick Piggin wrote:
> > > Just extract the part where it has the page locked into a common
> > > function.
> > 
> > That doesn't do some stuff we want to do, like try_to_release_buffers
> > And there's the page count problem with remove_mapping
> > 
> > That could be probably fixed, but to be honest I'm uncomfortable
> > fiddling with truncate internals.
> 
> You're looking at invalidate, which is different. See my
> last patch.

Hmm. 

> 
> > > > Is there anything concretely wrong with the current code?
> > > 
> > > 
> > > /*
> > >  * Truncate does the same, but we're not quite the same
> > >  * as truncate. Needs more checking, but keep it for now.
> > >  */
> > > 
> > > I guess that it duplicates this tricky truncate code and also
> > > says it is different (but AFAIKS it is doing exactly the same
> > > thing).
> > 
> > It's not, there are various differences (like the reference count)
> 
> No. If there are, then it *really* needs better documentation. I
> don't think there are, though.

Better documentation on what? You want a detailed listing in a comment
how it is different from truncate?

To be honest I have some doubts of the usefulness of such a comment
(why stop at truncate and not list the differences to every other
page cache operation? @) but if you're insist (do you?) I can add one.

> I'm suggesting that EIO is traditionally for when the data still
> dirty in pagecache and was not able to get back to backing
> store. Do you deny that?

Yes. That is exactly the case when memory-failure triggers EIO

Memory error on a dirty file mapped page.

> And I think the application might try to handle the case of a
> page becoming corrupted differently. Do you deny that?

You mean a clean file-mapped page? In this case there is no EIO,
memory-failure just drops the page and it is reloaded.

If the page is dirty we trigger EIO which as you said above is the 
right reaction.

> 
> OK, given the range of errors that APIs are defined to return,
> then maybe EIO is the best option. I don't suppose it is possible
> to expand them to return something else?

Expand the syscalls to return other errnos on specific
kinds of IO error?
 
Of course that's possible, but it has the problem that you 
would need to fix all the applications that expect EIO for
IO error. The later I consider infeasible.

> > > > > Just seems overengineered. We could rewrite any if/switch statement like
> > > > > that (and actually the compiler probably will if it is beneficial).
> > > > 
> > > > The reason I like it is that it separates the functions cleanly,
> > > > without that there would be a dispatcher from hell. Yes it's a bit
> > > > ugly that there is a lot of manual bit checking around now too,
> > > > but as you go into all the corner cases originally clean code
> > > > always tends to get more ugly (and this is a really ugly problem)
> > > 
> > > Well... it is just writing another dispatcher from hell in a
> > > different way really, isn't it? How is it so much better than
> > > a simple switch or if/elseif/else statement?
> > 
> > switch () wouldn't work, it relies on the order.
> 
> Order of what?

Order of the bit tests. 

> 
> 
> > if () would work, but it would be larger and IMHO much harder
> > to read than the table. I prefer the table, which is a reasonably
> > clean mechanism to do that.
> 
> OK. Maybe I'll try send a patch to remove it and see how it looks
> sometime. I think it's totally overengineered, but that's just me.

I disagree on that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:34                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 12:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:10:31PM +0200, Nick Piggin wrote:
> > > Just extract the part where it has the page locked into a common
> > > function.
> > 
> > That doesn't do some stuff we want to do, like try_to_release_buffers
> > And there's the page count problem with remove_mapping
> > 
> > That could be probably fixed, but to be honest I'm uncomfortable
> > fiddling with truncate internals.
> 
> You're looking at invalidate, which is different. See my
> last patch.

Hmm. 

> 
> > > > Is there anything concretely wrong with the current code?
> > > 
> > > 
> > > /*
> > >  * Truncate does the same, but we're not quite the same
> > >  * as truncate. Needs more checking, but keep it for now.
> > >  */
> > > 
> > > I guess that it duplicates this tricky truncate code and also
> > > says it is different (but AFAIKS it is doing exactly the same
> > > thing).
> > 
> > It's not, there are various differences (like the reference count)
> 
> No. If there are, then it *really* needs better documentation. I
> don't think there are, though.

Better documentation on what? You want a detailed listing in a comment
how it is different from truncate?

To be honest I have some doubts of the usefulness of such a comment
(why stop at truncate and not list the differences to every other
page cache operation? @) but if you're insist (do you?) I can add one.

> I'm suggesting that EIO is traditionally for when the data still
> dirty in pagecache and was not able to get back to backing
> store. Do you deny that?

Yes. That is exactly the case when memory-failure triggers EIO

Memory error on a dirty file mapped page.

> And I think the application might try to handle the case of a
> page becoming corrupted differently. Do you deny that?

You mean a clean file-mapped page? In this case there is no EIO,
memory-failure just drops the page and it is reloaded.

If the page is dirty we trigger EIO which as you said above is the 
right reaction.

> 
> OK, given the range of errors that APIs are defined to return,
> then maybe EIO is the best option. I don't suppose it is possible
> to expand them to return something else?

Expand the syscalls to return other errnos on specific
kinds of IO error?
 
Of course that's possible, but it has the problem that you 
would need to fix all the applications that expect EIO for
IO error. The later I consider infeasible.

> > > > > Just seems overengineered. We could rewrite any if/switch statement like
> > > > > that (and actually the compiler probably will if it is beneficial).
> > > > 
> > > > The reason I like it is that it separates the functions cleanly,
> > > > without that there would be a dispatcher from hell. Yes it's a bit
> > > > ugly that there is a lot of manual bit checking around now too,
> > > > but as you go into all the corner cases originally clean code
> > > > always tends to get more ugly (and this is a really ugly problem)
> > > 
> > > Well... it is just writing another dispatcher from hell in a
> > > different way really, isn't it? How is it so much better than
> > > a simple switch or if/elseif/else statement?
> > 
> > switch () wouldn't work, it relies on the order.
> 
> Order of what?

Order of the bit tests. 

> 
> 
> > if () would work, but it would be larger and IMHO much harder
> > to read than the table. I prefer the table, which is a reasonably
> > clean mechanism to do that.
> 
> OK. Maybe I'll try send a patch to remove it and see how it looks
> sometime. I think it's totally overengineered, but that's just me.

I disagree on that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:34                   ` Andi Kleen
@ 2009-06-02 12:37                     ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:34:50PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:10:31PM +0200, Nick Piggin wrote:
> > > It's not, there are various differences (like the reference count)
> > 
> > No. If there are, then it *really* needs better documentation. I
> > don't think there are, though.
> 
> Better documentation on what? You want a detailed listing in a comment
> how it is different from truncate?
> 
> To be honest I have some doubts of the usefulness of such a comment
> (why stop at truncate and not list the differences to every other
> page cache operation? @) but if you're insist (do you?) I can add one.

Because I don't see any difference (see my previous patch). I
still don't know what it is supposed to be doing differently.
So if you reinvent your own that looks close enough to truncate
to warrant a comment to say /* this is close to truncate but
not quite */, then yes I insist that you say exactly why it is
not quite like truncate ;)

 
> > I'm suggesting that EIO is traditionally for when the data still
> > dirty in pagecache and was not able to get back to backing
> > store. Do you deny that?
> 
> Yes. That is exactly the case when memory-failure triggers EIO
> 
> Memory error on a dirty file mapped page.

But it is no longer dirty, and the problem was not that the data
was unable to be written back.


> > And I think the application might try to handle the case of a
> > page becoming corrupted differently. Do you deny that?
> 
> You mean a clean file-mapped page? In this case there is no EIO,
> memory-failure just drops the page and it is reloaded.
> 
> If the page is dirty we trigger EIO which as you said above is the 
> right reaction.

No I mean the difference between the case of dirty page unable to
be written to backing sotre, and the case of dirty page becoming
corrupted.


> > OK, given the range of errors that APIs are defined to return,
> > then maybe EIO is the best option. I don't suppose it is possible
> > to expand them to return something else?
> 
> Expand the syscalls to return other errnos on specific
> kinds of IO error?
>  
> Of course that's possible, but it has the problem that you 
> would need to fix all the applications that expect EIO for
> IO error. The later I consider infeasible.

They would presumably exit or do some default thing, which I
think would be fine. Actually if your code catches them in the
act of manipulating a corrupted page (ie. if it is mmapped),
then it gets a SIGBUS.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:37                     ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:34:50PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:10:31PM +0200, Nick Piggin wrote:
> > > It's not, there are various differences (like the reference count)
> > 
> > No. If there are, then it *really* needs better documentation. I
> > don't think there are, though.
> 
> Better documentation on what? You want a detailed listing in a comment
> how it is different from truncate?
> 
> To be honest I have some doubts of the usefulness of such a comment
> (why stop at truncate and not list the differences to every other
> page cache operation? @) but if you're insist (do you?) I can add one.

Because I don't see any difference (see my previous patch). I
still don't know what it is supposed to be doing differently.
So if you reinvent your own that looks close enough to truncate
to warrant a comment to say /* this is close to truncate but
not quite */, then yes I insist that you say exactly why it is
not quite like truncate ;)

 
> > I'm suggesting that EIO is traditionally for when the data still
> > dirty in pagecache and was not able to get back to backing
> > store. Do you deny that?
> 
> Yes. That is exactly the case when memory-failure triggers EIO
> 
> Memory error on a dirty file mapped page.

But it is no longer dirty, and the problem was not that the data
was unable to be written back.


> > And I think the application might try to handle the case of a
> > page becoming corrupted differently. Do you deny that?
> 
> You mean a clean file-mapped page? In this case there is no EIO,
> memory-failure just drops the page and it is reloaded.
> 
> If the page is dirty we trigger EIO which as you said above is the 
> right reaction.

No I mean the difference between the case of dirty page unable to
be written to backing sotre, and the case of dirty page becoming
corrupted.


> > OK, given the range of errors that APIs are defined to return,
> > then maybe EIO is the best option. I don't suppose it is possible
> > to expand them to return something else?
> 
> Expand the syscalls to return other errnos on specific
> kinds of IO error?
>  
> Of course that's possible, but it has the problem that you 
> would need to fix all the applications that expect EIO for
> IO error. The later I consider infeasible.

They would presumably exit or do some default thing, which I
think would be fine. Actually if your code catches them in the
act of manipulating a corrupted page (ie. if it is mmapped),
then it gets a SIGBUS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:00                 ` Nick Piggin
@ 2009-06-02 12:47                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 12:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > > Another major complexity is on calling the isolation routines to
> > > > remove references from
> > > >         - PTE
> > > >         - page cache
> > > >         - swap cache
> > > >         - LRU list
> > > > They more or less made some assumptions on their operating environment
> > > > that we have to take care of.  Unfortunately these complexities are
> > > > also not easily resolvable.
> > > > 
> > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > > 
> > > > I promise I'll add more comments :)
> > > 
> > > OK, but they should still go in their relevant files. Or as best as
> > > possible. Right now it's just silly to have all this here when much
> > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > 
> > Can you be more specific what that "all this" is? 
> 
> The functions which take action in response to a bad page being 
> detected. They belong with the subsystem that the page belongs
> to. I'm amazed this is causing so much argument or confusion
> because it is how the rest of mm/ code is arranged. OK, Hugh has
> a point about ifdefs, but OTOH we have lots of ifdefs like this.

Well we're already calling into that subsystem, just not with
a single function call.

> > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > truncate_inode_pages_range is the correct API to use.
> > > > >
> > > > > (or you could extract out some of it so you can call it directly on
> > > > > individual locked pages, if that helps).
> > > >  
> > > > The patch to move over to truncate_complete_page() would like this.
> > > > It's not a big win indeed.
> > > 
> > > No I don't mean to do this, but to move the truncate_inode_pages
> > > code for truncating a single, locked, page into another function
> > > in mm/truncate.c and then call that from here.
> > 
> > I took a look at that.  First there's no direct equivalent of
> > me_pagecache_clean/dirty in truncate.c and to be honest I don't
> > see a clean way to refactor any of the existing functions to 
> > do the same.
> 
> With all that writing you could have just done it. It's really

I would have done it if it made sense to me, but so far it hasn't.

The problem with your suggestion is that you do the big picture,
but seem to skip over a lot of details. But details matter.

> not a big deal and just avoids duplicating code. I attached an
> (untested) patch.

Thanks. But the function in the patch is not doing the same what
the me_pagecache_clean/dirty are doing. For once there is no error
checking, as in the second try_to_release_page()

Then it doesn't do all the IO error and missing mapping handling.

The page_mapped() check is useless because the pages are not 
mapped here etc.

We could probably call truncate_complete_page(), but then
we would also need to duplicate most of the checking outside
the function anyways and there wouldn't be any possibility
to share the clean/dirty variants. If you insist I can
do it, but I think it would be significantly worse code
than before and I'm reluctant to do that.

I don't also really see what the big deal is of just
calling these few functions directly. After all we're not
truncating here and they're all already called from other files.

> > > No, it seems rather insane to do something like this here that no other
> > > code in the mm ever does.
> > 
> > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> 
> It is going to be possible to do it somehow surely, but it is insane
> to try to add such constraints to the VM to close a few small windows

We don't know currently if they are small. If they are small I would
agree with you, but that needs numbers. That said fancy writeback handling
is currently not on my agenda.

> if you already have other large ones.

That's unclear too.

-Andi


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:47                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 12:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > > Another major complexity is on calling the isolation routines to
> > > > remove references from
> > > >         - PTE
> > > >         - page cache
> > > >         - swap cache
> > > >         - LRU list
> > > > They more or less made some assumptions on their operating environment
> > > > that we have to take care of.  Unfortunately these complexities are
> > > > also not easily resolvable.
> > > > 
> > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > > 
> > > > I promise I'll add more comments :)
> > > 
> > > OK, but they should still go in their relevant files. Or as best as
> > > possible. Right now it's just silly to have all this here when much
> > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > 
> > Can you be more specific what that "all this" is? 
> 
> The functions which take action in response to a bad page being 
> detected. They belong with the subsystem that the page belongs
> to. I'm amazed this is causing so much argument or confusion
> because it is how the rest of mm/ code is arranged. OK, Hugh has
> a point about ifdefs, but OTOH we have lots of ifdefs like this.

Well we're already calling into that subsystem, just not with
a single function call.

> > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > truncate_inode_pages_range is the correct API to use.
> > > > >
> > > > > (or you could extract out some of it so you can call it directly on
> > > > > individual locked pages, if that helps).
> > > >  
> > > > The patch to move over to truncate_complete_page() would like this.
> > > > It's not a big win indeed.
> > > 
> > > No I don't mean to do this, but to move the truncate_inode_pages
> > > code for truncating a single, locked, page into another function
> > > in mm/truncate.c and then call that from here.
> > 
> > I took a look at that.  First there's no direct equivalent of
> > me_pagecache_clean/dirty in truncate.c and to be honest I don't
> > see a clean way to refactor any of the existing functions to 
> > do the same.
> 
> With all that writing you could have just done it. It's really

I would have done it if it made sense to me, but so far it hasn't.

The problem with your suggestion is that you do the big picture,
but seem to skip over a lot of details. But details matter.

> not a big deal and just avoids duplicating code. I attached an
> (untested) patch.

Thanks. But the function in the patch is not doing the same what
the me_pagecache_clean/dirty are doing. For once there is no error
checking, as in the second try_to_release_page()

Then it doesn't do all the IO error and missing mapping handling.

The page_mapped() check is useless because the pages are not 
mapped here etc.

We could probably call truncate_complete_page(), but then
we would also need to duplicate most of the checking outside
the function anyways and there wouldn't be any possibility
to share the clean/dirty variants. If you insist I can
do it, but I think it would be significantly worse code
than before and I'm reluctant to do that.

I don't also really see what the big deal is of just
calling these few functions directly. After all we're not
truncating here and they're all already called from other files.

> > > No, it seems rather insane to do something like this here that no other
> > > code in the mm ever does.
> > 
> > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> 
> It is going to be possible to do it somehow surely, but it is insane
> to try to add such constraints to the VM to close a few small windows

We don't know currently if they are small. If they are small I would
agree with you, but that needs numbers. That said fancy writeback handling
is currently not on my agenda.

> if you already have other large ones.

That's unclear too.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:19                     ` Nick Piggin
@ 2009-06-02 12:51                       ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 12:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > But you just said that you try to intercept the IO. So the underlying
> > > data is not necessarily corrupt. And even if it was then what if it
> > > was reinitialized to something else in the meantime (such as filesystem
> > > metadata blocks?) You'd just be introducing worse possibilities for
> > > coruption.
> > 
> > The IO interception will be based on PFN instead of file offset, so it
> > won't affect innocent pages such as your example of reinitialized data.
> 
> OK, if you could intercept the IO so it never happens at all, yes
> of course that could work.
> 
> > poisoned dirty page == corrupt data      => process shall be killed
> > poisoned clean page == recoverable data  => process shall survive
> > 
> > In the case of dirty hwpoison page, if we reload the on disk old data
> > and let application proceed with it, it may lead to *silent* data
> > corruption/inconsistency, because the application will first see v2
> > then v1, which is illogical and hence may mess up its internal data
> > structure.
> 
> Right, but how do you prevent that? There is no way to reconstruct the
> most updtodate data because it was destroyed.

To kill the application ruthlessly, rather than allow it go rotten quietly.

> > > You will need to demonstrate a *big* advantage before doing crazy things
> > > with writeback ;)
> > 
> > OK. We can do two things about poisoned writeback pages:
> > 
> > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> >    trigger further machine checks
> 
> 1b) At which point, you invoke the end-io handlers, and the page is
> no longer writeback.
> 
> > 2) to isolate them from page cache, thus preventing possible
> >    references in the writeback time window
> 
> And then this is possible because you aren't violating mm
> assumptions due to 1b. This proceeds just as the existing
> pagecache mce error handler case which exists now.

Yeah that's a good scheme - we are talking about two interception
scheme. Mine is passive one and yours is active one.

passive: check hwpoison pages at __generic_make_request()/elv_next_request() 
         (the code will be enabled by an mce_bad_io_pages counter)

active:  iterate all queued requests for hwpoison pages

Each has its merits and complexities.

I'll list the merits(+) and complexities(-) of the passive approach,
with them you automatically get the merits of the active one:

+ works on generic code and don't have to touch all deadline/as/cfq elevators
- the wait_on_page_writeback() puzzle because of the writeback time window

+ could also intercept the "cannot de-dirty for now" pages when they
  eventually go to writeback IO
- have to avoid filesystem references on PG_hwpoison pages, eg.
  - zeroing partial EOF page when i_size is not page aligned
  - calculating checksums


> > > > Now it's obvious that reusing more code than truncate_complete_page()
> > > > is not easy (or natural).
> > > 
> > > Just lock the page and wait for writeback, then do the truncate
> > > work in another function. In your case if you've already unmapped
> > > the page then it won't try to unmap again so no problem.
> > > 
> > > Truncating from pagecache does not change ->index so you can
> > > move the loop logic out.
> > 
> > Right. So effectively the reusable function is exactly
> > truncate_complete_page(). As I said this reuse is not a big gain.
> 
> Anyway, we don't have to argue about it. I already send a patch
> because it was so hard to do, so let's move past this ;)
> 
> 
> > > > Yes it's kind of insane.  I'm interested in reasoning it out though.
> 
> Well with the IO interception (I missed this point), then it seems
> maybe no longer so insane. We could see how it looks.

OK.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:51                       ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 12:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > But you just said that you try to intercept the IO. So the underlying
> > > data is not necessarily corrupt. And even if it was then what if it
> > > was reinitialized to something else in the meantime (such as filesystem
> > > metadata blocks?) You'd just be introducing worse possibilities for
> > > coruption.
> > 
> > The IO interception will be based on PFN instead of file offset, so it
> > won't affect innocent pages such as your example of reinitialized data.
> 
> OK, if you could intercept the IO so it never happens at all, yes
> of course that could work.
> 
> > poisoned dirty page == corrupt data      => process shall be killed
> > poisoned clean page == recoverable data  => process shall survive
> > 
> > In the case of dirty hwpoison page, if we reload the on disk old data
> > and let application proceed with it, it may lead to *silent* data
> > corruption/inconsistency, because the application will first see v2
> > then v1, which is illogical and hence may mess up its internal data
> > structure.
> 
> Right, but how do you prevent that? There is no way to reconstruct the
> most updtodate data because it was destroyed.

To kill the application ruthlessly, rather than allow it go rotten quietly.

> > > You will need to demonstrate a *big* advantage before doing crazy things
> > > with writeback ;)
> > 
> > OK. We can do two things about poisoned writeback pages:
> > 
> > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> >    trigger further machine checks
> 
> 1b) At which point, you invoke the end-io handlers, and the page is
> no longer writeback.
> 
> > 2) to isolate them from page cache, thus preventing possible
> >    references in the writeback time window
> 
> And then this is possible because you aren't violating mm
> assumptions due to 1b. This proceeds just as the existing
> pagecache mce error handler case which exists now.

Yeah that's a good scheme - we are talking about two interception
scheme. Mine is passive one and yours is active one.

passive: check hwpoison pages at __generic_make_request()/elv_next_request() 
         (the code will be enabled by an mce_bad_io_pages counter)

active:  iterate all queued requests for hwpoison pages

Each has its merits and complexities.

I'll list the merits(+) and complexities(-) of the passive approach,
with them you automatically get the merits of the active one:

+ works on generic code and don't have to touch all deadline/as/cfq elevators
- the wait_on_page_writeback() puzzle because of the writeback time window

+ could also intercept the "cannot de-dirty for now" pages when they
  eventually go to writeback IO
- have to avoid filesystem references on PG_hwpoison pages, eg.
  - zeroing partial EOF page when i_size is not page aligned
  - calculating checksums


> > > > Now it's obvious that reusing more code than truncate_complete_page()
> > > > is not easy (or natural).
> > > 
> > > Just lock the page and wait for writeback, then do the truncate
> > > work in another function. In your case if you've already unmapped
> > > the page then it won't try to unmap again so no problem.
> > > 
> > > Truncating from pagecache does not change ->index so you can
> > > move the loop logic out.
> > 
> > Right. So effectively the reusable function is exactly
> > truncate_complete_page(). As I said this reuse is not a big gain.
> 
> Anyway, we don't have to argue about it. I already send a patch
> because it was so hard to do, so let's move past this ;)
> 
> 
> > > > Yes it's kind of insane.  I'm interested in reasoning it out though.
> 
> Well with the IO interception (I missed this point), then it seems
> maybe no longer so insane. We could see how it looks.

OK.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:37                     ` Nick Piggin
@ 2009-06-02 12:55                       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 12:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:37:20PM +0200, Nick Piggin wrote:
> Because I don't see any difference (see my previous patch). I
> still don't know what it is supposed to be doing differently.
> So if you reinvent your own that looks close enough to truncate
> to warrant a comment to say /* this is close to truncate but
> not quite */, then yes I insist that you say exactly why it is
> not quite like truncate ;)

I will just delete that comment because it apparently causes so 
much confusion.

> 
>  
> > > I'm suggesting that EIO is traditionally for when the data still
> > > dirty in pagecache and was not able to get back to backing
> > > store. Do you deny that?
> > 
> > Yes. That is exactly the case when memory-failure triggers EIO
> > 
> > Memory error on a dirty file mapped page.
> 
> But it is no longer dirty, and the problem was not that the data
> was unable to be written back.

Sorry I don't understand. What do you mean with "no longer dirty"

Of course it's still dirty, just has to be discarded because it's 
corrupted.

> > > And I think the application might try to handle the case of a
> > > page becoming corrupted differently. Do you deny that?
> > 
> > You mean a clean file-mapped page? In this case there is no EIO,
> > memory-failure just drops the page and it is reloaded.
> > 
> > If the page is dirty we trigger EIO which as you said above is the 
> > right reaction.
> 
> No I mean the difference between the case of dirty page unable to
> be written to backing sotre, and the case of dirty page becoming
> corrupted.

Nick, I have really a hard time following you here.

What exactly do you want? 

A new errno? Or something else? If yes what precisely?

I currently don't see any sane way to report this to the application
through write().  That is because adding a new errno for something
is incredibly hard and often impossible, and that's certainly
the case here.

The application can detect it if it maps the 
shared page and waits for a SIGBUS, but not through write().

But I doubt there will be really any apps that do anything differently
here anyways. A clever app could retry a few times if it still
has a copy of the data, but that might even make sense on normal
IO errors (e.g. on a SAN).

> 
> 
> > > OK, given the range of errors that APIs are defined to return,
> > > then maybe EIO is the best option. I don't suppose it is possible
> > > to expand them to return something else?
> > 
> > Expand the syscalls to return other errnos on specific
> > kinds of IO error?
> >  
> > Of course that's possible, but it has the problem that you 
> > would need to fix all the applications that expect EIO for
> > IO error. The later I consider infeasible.
> 
> They would presumably exit or do some default thing, which I
> think would be fine.

No it's not fine if they would handle EIO. e.g. consider
a sophisticated database which likely has sophisticated
IO error mechanisms too (e.g. only abort the current commit)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:55                       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 12:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:37:20PM +0200, Nick Piggin wrote:
> Because I don't see any difference (see my previous patch). I
> still don't know what it is supposed to be doing differently.
> So if you reinvent your own that looks close enough to truncate
> to warrant a comment to say /* this is close to truncate but
> not quite */, then yes I insist that you say exactly why it is
> not quite like truncate ;)

I will just delete that comment because it apparently causes so 
much confusion.

> 
>  
> > > I'm suggesting that EIO is traditionally for when the data still
> > > dirty in pagecache and was not able to get back to backing
> > > store. Do you deny that?
> > 
> > Yes. That is exactly the case when memory-failure triggers EIO
> > 
> > Memory error on a dirty file mapped page.
> 
> But it is no longer dirty, and the problem was not that the data
> was unable to be written back.

Sorry I don't understand. What do you mean with "no longer dirty"

Of course it's still dirty, just has to be discarded because it's 
corrupted.

> > > And I think the application might try to handle the case of a
> > > page becoming corrupted differently. Do you deny that?
> > 
> > You mean a clean file-mapped page? In this case there is no EIO,
> > memory-failure just drops the page and it is reloaded.
> > 
> > If the page is dirty we trigger EIO which as you said above is the 
> > right reaction.
> 
> No I mean the difference between the case of dirty page unable to
> be written to backing sotre, and the case of dirty page becoming
> corrupted.

Nick, I have really a hard time following you here.

What exactly do you want? 

A new errno? Or something else? If yes what precisely?

I currently don't see any sane way to report this to the application
through write().  That is because adding a new errno for something
is incredibly hard and often impossible, and that's certainly
the case here.

The application can detect it if it maps the 
shared page and waits for a SIGBUS, but not through write().

But I doubt there will be really any apps that do anything differently
here anyways. A clever app could retry a few times if it still
has a copy of the data, but that might even make sense on normal
IO errors (e.g. on a SAN).

> 
> 
> > > OK, given the range of errors that APIs are defined to return,
> > > then maybe EIO is the best option. I don't suppose it is possible
> > > to expand them to return something else?
> > 
> > Expand the syscalls to return other errnos on specific
> > kinds of IO error?
> >  
> > Of course that's possible, but it has the problem that you 
> > would need to fix all the applications that expect EIO for
> > IO error. The later I consider infeasible.
> 
> They would presumably exit or do some default thing, which I
> think would be fine.

No it's not fine if they would handle EIO. e.g. consider
a sophisticated database which likely has sophisticated
IO error mechanisms too (e.g. only abort the current commit)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:47                   ` Andi Kleen
@ 2009-06-02 12:57                     ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > not a big deal and just avoids duplicating code. I attached an
> > (untested) patch.
> 
> Thanks. But the function in the patch is not doing the same what
> the me_pagecache_clean/dirty are doing. For once there is no error
> checking, as in the second try_to_release_page()
> 
> Then it doesn't do all the IO error and missing mapping handling.

Obviously I don't mean just use that single call for the entire
handler. You can set the EIO bit or whatever you like. The
"error handling" you have there also seems strange. You could
retain it, but the page is assured to be removed from pagecache.

 
> The page_mapped() check is useless because the pages are not 
> mapped here etc.

That's OK, it is a core part of the protocol to prevent
truncated pages from being mapped, so I like it to be in
that function.

(you are also doing extraneous page_mapped tests in your handler,
so surely your concern isn't from the perspective of this
error handler code)


> We could probably call truncate_complete_page(), but then
> we would also need to duplicate most of the checking outside
> the function anyways and there wouldn't be any possibility
> to share the clean/dirty variants. If you insist I can
> do it, but I think it would be significantly worse code
> than before and I'm reluctant to do that.

I can write you the patch for that too if you like.

 
> I don't also really see what the big deal is of just
> calling these few functions directly. After all we're not
> truncating here and they're all already called from other files.
>
> > > > No, it seems rather insane to do something like this here that no other
> > > > code in the mm ever does.
> > > 
> > > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> > 
> > It is going to be possible to do it somehow surely, but it is insane
> > to try to add such constraints to the VM to close a few small windows
> 
> We don't know currently if they are small. If they are small I would
> agree with you, but that needs numbers. That said fancy writeback handling
> is currently not on my agenda.

Yes, writeback pages are very limited, a tiny number at any time and
the faction gets relatively smaller as total RAM size gets larger.


> > if you already have other large ones.
> 
> That's unclear too.

You can't do much about most kernel pages, and dirty metadata pages
are both going to cause big problems. User pagetable pages. Lots of
stuff.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 12:57                     ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 12:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > not a big deal and just avoids duplicating code. I attached an
> > (untested) patch.
> 
> Thanks. But the function in the patch is not doing the same what
> the me_pagecache_clean/dirty are doing. For once there is no error
> checking, as in the second try_to_release_page()
> 
> Then it doesn't do all the IO error and missing mapping handling.

Obviously I don't mean just use that single call for the entire
handler. You can set the EIO bit or whatever you like. The
"error handling" you have there also seems strange. You could
retain it, but the page is assured to be removed from pagecache.

 
> The page_mapped() check is useless because the pages are not 
> mapped here etc.

That's OK, it is a core part of the protocol to prevent
truncated pages from being mapped, so I like it to be in
that function.

(you are also doing extraneous page_mapped tests in your handler,
so surely your concern isn't from the perspective of this
error handler code)


> We could probably call truncate_complete_page(), but then
> we would also need to duplicate most of the checking outside
> the function anyways and there wouldn't be any possibility
> to share the clean/dirty variants. If you insist I can
> do it, but I think it would be significantly worse code
> than before and I'm reluctant to do that.

I can write you the patch for that too if you like.

 
> I don't also really see what the big deal is of just
> calling these few functions directly. After all we're not
> truncating here and they're all already called from other files.
>
> > > > No, it seems rather insane to do something like this here that no other
> > > > code in the mm ever does.
> > > 
> > > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> > 
> > It is going to be possible to do it somehow surely, but it is insane
> > to try to add such constraints to the VM to close a few small windows
> 
> We don't know currently if they are small. If they are small I would
> agree with you, but that needs numbers. That said fancy writeback handling
> is currently not on my agenda.

Yes, writeback pages are very limited, a tiny number at any time and
the faction gets relatively smaller as total RAM size gets larger.


> > if you already have other large ones.
> 
> That's unclear too.

You can't do much about most kernel pages, and dirty metadata pages
are both going to cause big problems. User pagetable pages. Lots of
stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:47                   ` Andi Kleen
@ 2009-06-02 13:02                     ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:47:57PM +0800, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > > On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > > > Another major complexity is on calling the isolation routines to
> > > > > remove references from
> > > > >         - PTE
> > > > >         - page cache
> > > > >         - swap cache
> > > > >         - LRU list
> > > > > They more or less made some assumptions on their operating environment
> > > > > that we have to take care of.  Unfortunately these complexities are
> > > > > also not easily resolvable.
> > > > > 
> > > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > > > 
> > > > > I promise I'll add more comments :)
> > > > 
> > > > OK, but they should still go in their relevant files. Or as best as
> > > > possible. Right now it's just silly to have all this here when much
> > > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > > 
> > > Can you be more specific what that "all this" is? 
> > 
> > The functions which take action in response to a bad page being 
> > detected. They belong with the subsystem that the page belongs
> > to. I'm amazed this is causing so much argument or confusion
> > because it is how the rest of mm/ code is arranged. OK, Hugh has
> > a point about ifdefs, but OTOH we have lots of ifdefs like this.
> 
> Well we're already calling into that subsystem, just not with
> a single function call.
> 
> > > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > > truncate_inode_pages_range is the correct API to use.
> > > > > >
> > > > > > (or you could extract out some of it so you can call it directly on
> > > > > > individual locked pages, if that helps).
> > > > >  
> > > > > The patch to move over to truncate_complete_page() would like this.
> > > > > It's not a big win indeed.
> > > > 
> > > > No I don't mean to do this, but to move the truncate_inode_pages
> > > > code for truncating a single, locked, page into another function
> > > > in mm/truncate.c and then call that from here.
> > > 
> > > I took a look at that.  First there's no direct equivalent of
> > > me_pagecache_clean/dirty in truncate.c and to be honest I don't
> > > see a clean way to refactor any of the existing functions to 
> > > do the same.
> > 
> > With all that writing you could have just done it. It's really
> 
> I would have done it if it made sense to me, but so far it hasn't.
> 
> The problem with your suggestion is that you do the big picture,
> but seem to skip over a lot of details. But details matter.
> 
> > not a big deal and just avoids duplicating code. I attached an
> > (untested) patch.
> 
> Thanks. But the function in the patch is not doing the same what
> the me_pagecache_clean/dirty are doing. For once there is no error
> checking, as in the second try_to_release_page()
> 
> Then it doesn't do all the IO error and missing mapping handling.
> 
> The page_mapped() check is useless because the pages are not 
> mapped here etc.
> 
> We could probably call truncate_complete_page(), but then
> we would also need to duplicate most of the checking outside
> the function anyways and there wouldn't be any possibility
> to share the clean/dirty variants. If you insist I can
> do it, but I think it would be significantly worse code
> than before and I'm reluctant to do that.
> 
> I don't also really see what the big deal is of just
> calling these few functions directly. After all we're not
> truncating here and they're all already called from other files.

Yes I like the current "one code block calling one elemental function
to isolate from one reference source" scenario:
         - PTE
         - page cache
         - swap cache
         - LRU list

Calling into the generic truncate code only messes up the concepts.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:02                     ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:47:57PM +0800, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > > On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > > > Another major complexity is on calling the isolation routines to
> > > > > remove references from
> > > > >         - PTE
> > > > >         - page cache
> > > > >         - swap cache
> > > > >         - LRU list
> > > > > They more or less made some assumptions on their operating environment
> > > > > that we have to take care of.  Unfortunately these complexities are
> > > > > also not easily resolvable.
> > > > > 
> > > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > > > 
> > > > > I promise I'll add more comments :)
> > > > 
> > > > OK, but they should still go in their relevant files. Or as best as
> > > > possible. Right now it's just silly to have all this here when much
> > > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > > 
> > > Can you be more specific what that "all this" is? 
> > 
> > The functions which take action in response to a bad page being 
> > detected. They belong with the subsystem that the page belongs
> > to. I'm amazed this is causing so much argument or confusion
> > because it is how the rest of mm/ code is arranged. OK, Hugh has
> > a point about ifdefs, but OTOH we have lots of ifdefs like this.
> 
> Well we're already calling into that subsystem, just not with
> a single function call.
> 
> > > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > > truncate_inode_pages_range is the correct API to use.
> > > > > >
> > > > > > (or you could extract out some of it so you can call it directly on
> > > > > > individual locked pages, if that helps).
> > > > >  
> > > > > The patch to move over to truncate_complete_page() would like this.
> > > > > It's not a big win indeed.
> > > > 
> > > > No I don't mean to do this, but to move the truncate_inode_pages
> > > > code for truncating a single, locked, page into another function
> > > > in mm/truncate.c and then call that from here.
> > > 
> > > I took a look at that.  First there's no direct equivalent of
> > > me_pagecache_clean/dirty in truncate.c and to be honest I don't
> > > see a clean way to refactor any of the existing functions to 
> > > do the same.
> > 
> > With all that writing you could have just done it. It's really
> 
> I would have done it if it made sense to me, but so far it hasn't.
> 
> The problem with your suggestion is that you do the big picture,
> but seem to skip over a lot of details. But details matter.
> 
> > not a big deal and just avoids duplicating code. I attached an
> > (untested) patch.
> 
> Thanks. But the function in the patch is not doing the same what
> the me_pagecache_clean/dirty are doing. For once there is no error
> checking, as in the second try_to_release_page()
> 
> Then it doesn't do all the IO error and missing mapping handling.
> 
> The page_mapped() check is useless because the pages are not 
> mapped here etc.
> 
> We could probably call truncate_complete_page(), but then
> we would also need to duplicate most of the checking outside
> the function anyways and there wouldn't be any possibility
> to share the clean/dirty variants. If you insist I can
> do it, but I think it would be significantly worse code
> than before and I'm reluctant to do that.
> 
> I don't also really see what the big deal is of just
> calling these few functions directly. After all we're not
> truncating here and they're all already called from other files.

Yes I like the current "one code block calling one elemental function
to isolate from one reference source" scenario:
         - PTE
         - page cache
         - swap cache
         - LRU list

Calling into the generic truncate code only messes up the concepts.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:55                       ` Andi Kleen
@ 2009-06-02 13:03                         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:55:38PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:37:20PM +0200, Nick Piggin wrote:
> > Because I don't see any difference (see my previous patch). I
> > still don't know what it is supposed to be doing differently.
> > So if you reinvent your own that looks close enough to truncate
> > to warrant a comment to say /* this is close to truncate but
> > not quite */, then yes I insist that you say exactly why it is
> > not quite like truncate ;)
> 
> I will just delete that comment because it apparently causes so 
> much confusion.

And replace it with something that actually clears up the
confusion.


> > > > I'm suggesting that EIO is traditionally for when the data still
> > > > dirty in pagecache and was not able to get back to backing
> > > > store. Do you deny that?
> > > 
> > > Yes. That is exactly the case when memory-failure triggers EIO
> > > 
> > > Memory error on a dirty file mapped page.
> > 
> > But it is no longer dirty, and the problem was not that the data
> > was unable to be written back.
> 
> Sorry I don't understand. What do you mean with "no longer dirty"
> 
> Of course it's still dirty, just has to be discarded because it's 
> corrupted.

The pagecache location is no longer dirty. Userspace only cares
about pagecache locations and their contents, not the page that
was once there and has now been taken out.

 
> > > > And I think the application might try to handle the case of a
> > > > page becoming corrupted differently. Do you deny that?
> > > 
> > > You mean a clean file-mapped page? In this case there is no EIO,
> > > memory-failure just drops the page and it is reloaded.
> > > 
> > > If the page is dirty we trigger EIO which as you said above is the 
> > > right reaction.
> > 
> > No I mean the difference between the case of dirty page unable to
> > be written to backing sotre, and the case of dirty page becoming
> > corrupted.
> 
> Nick, I have really a hard time following you here.
> 
> What exactly do you want? 
> 
> A new errno? Or something else? If yes what precisely?

Yeah a new errno would be nice. Precisely one to say that the
memory was corrupted.

 
> I currently don't see any sane way to report this to the application
> through write().  That is because adding a new errno for something
> is incredibly hard and often impossible, and that's certainly
> the case here.
> 
> The application can detect it if it maps the 
> shared page and waits for a SIGBUS, but not through write().
> 
> But I doubt there will be really any apps that do anything differently
> here anyways. A clever app could retry a few times if it still
> has a copy of the data, but that might even make sense on normal
> IO errors (e.g. on a SAN).

I'm sure some of the ones that really care would.


> > They would presumably exit or do some default thing, which I
> > think would be fine.
> 
> No it's not fine if they would handle EIO. e.g. consider
> a sophisticated database which likely has sophisticated
> IO error mechanisms too (e.g. only abort the current commit)

Umm, if it is a generic "this failed, we can abort" then why
would not this be the default case. The issue is if it does
something differnet specifically for EIO, and specifically
assuming the pagecache is still valid and dirty.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:03                         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 02:55:38PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:37:20PM +0200, Nick Piggin wrote:
> > Because I don't see any difference (see my previous patch). I
> > still don't know what it is supposed to be doing differently.
> > So if you reinvent your own that looks close enough to truncate
> > to warrant a comment to say /* this is close to truncate but
> > not quite */, then yes I insist that you say exactly why it is
> > not quite like truncate ;)
> 
> I will just delete that comment because it apparently causes so 
> much confusion.

And replace it with something that actually clears up the
confusion.


> > > > I'm suggesting that EIO is traditionally for when the data still
> > > > dirty in pagecache and was not able to get back to backing
> > > > store. Do you deny that?
> > > 
> > > Yes. That is exactly the case when memory-failure triggers EIO
> > > 
> > > Memory error on a dirty file mapped page.
> > 
> > But it is no longer dirty, and the problem was not that the data
> > was unable to be written back.
> 
> Sorry I don't understand. What do you mean with "no longer dirty"
> 
> Of course it's still dirty, just has to be discarded because it's 
> corrupted.

The pagecache location is no longer dirty. Userspace only cares
about pagecache locations and their contents, not the page that
was once there and has now been taken out.

 
> > > > And I think the application might try to handle the case of a
> > > > page becoming corrupted differently. Do you deny that?
> > > 
> > > You mean a clean file-mapped page? In this case there is no EIO,
> > > memory-failure just drops the page and it is reloaded.
> > > 
> > > If the page is dirty we trigger EIO which as you said above is the 
> > > right reaction.
> > 
> > No I mean the difference between the case of dirty page unable to
> > be written to backing sotre, and the case of dirty page becoming
> > corrupted.
> 
> Nick, I have really a hard time following you here.
> 
> What exactly do you want? 
> 
> A new errno? Or something else? If yes what precisely?

Yeah a new errno would be nice. Precisely one to say that the
memory was corrupted.

 
> I currently don't see any sane way to report this to the application
> through write().  That is because adding a new errno for something
> is incredibly hard and often impossible, and that's certainly
> the case here.
> 
> The application can detect it if it maps the 
> shared page and waits for a SIGBUS, but not through write().
> 
> But I doubt there will be really any apps that do anything differently
> here anyways. A clever app could retry a few times if it still
> has a copy of the data, but that might even make sense on normal
> IO errors (e.g. on a SAN).

I'm sure some of the ones that really care would.


> > They would presumably exit or do some default thing, which I
> > think would be fine.
> 
> No it's not fine if they would handle EIO. e.g. consider
> a sophisticated database which likely has sophisticated
> IO error mechanisms too (e.g. only abort the current commit)

Umm, if it is a generic "this failed, we can abort" then why
would not this be the default case. The issue is if it does
something differnet specifically for EIO, and specifically
assuming the pagecache is still valid and dirty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:20                           ` Andi Kleen
@ 2009-06-02 13:19                             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:20:02PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:03:06PM +0200, Nick Piggin wrote:
> > > > > > I'm suggesting that EIO is traditionally for when the data still
> > > > > > dirty in pagecache and was not able to get back to backing
> > > > > > store. Do you deny that?
> > > > > 
> > > > > Yes. That is exactly the case when memory-failure triggers EIO
> > > > > 
> > > > > Memory error on a dirty file mapped page.
> > > > 
> > > > But it is no longer dirty, and the problem was not that the data
> > > > was unable to be written back.
> > > 
> > > Sorry I don't understand. What do you mean with "no longer dirty"
> > > 
> > > Of course it's still dirty, just has to be discarded because it's 
> > > corrupted.
> > 
> > The pagecache location is no longer dirty. Userspace only cares
> > about pagecache locations and their contents, not the page that
> > was once there and has now been taken out.
> 
> Sorry, but that just sounds wrong to me. User space has no clue
> about the page cache, it just wants to know that the write it just

Umm, pagecache location is inode,offset, which is exactly what
userspace cares about, and they obviously know there can be a
writeback cache there because that's why fsync exists.


> did didn't reach the disk. And that's what happened and
> what we report here.

I didn't reach the disk and the dirty data was destroyed and
will be recreated from some unknown (to userspace) state 
from the filesystem. If you can't see how this is different
to the rest of our EIO conditions, then I can't spell it out
any simpler.


> > > A new errno? Or something else? If yes what precisely?
> > 
> > Yeah a new errno would be nice. Precisely one to say that the
> > memory was corrupted.
> 
> Ok.  I firmly think a new errno is a bad idea because I don't
> want to explain to a lot of people how to fix their applications.
> Compatibility is important.

Fair enough, maybe EIO is the best option, but I just want
people to think about it.

 
> > > No it's not fine if they would handle EIO. e.g. consider
> > > a sophisticated database which likely has sophisticated
> > > IO error mechanisms too (e.g. only abort the current commit)
> > 
> > Umm, if it is a generic "this failed, we can abort" then why
> > would not this be the default case. The issue is if it does
> > something differnet specifically for EIO, and specifically
> > assuming the pagecache is still valid and dirty.
> 
> How would it make such an assumption?

I guess either way you have to make assumptions that the
app uses errno in a particular way.

 
> I assume that if an application does something with EIO it 
> can either retry a few times or give up. Both is ok here.

That's exactly the case where it is not OK, because the
dirty page was now removed from pagecache, so the subsequent
fsync is going to succeed and the app will think its dirty
data has hit disk.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:19                             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:20:02PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:03:06PM +0200, Nick Piggin wrote:
> > > > > > I'm suggesting that EIO is traditionally for when the data still
> > > > > > dirty in pagecache and was not able to get back to backing
> > > > > > store. Do you deny that?
> > > > > 
> > > > > Yes. That is exactly the case when memory-failure triggers EIO
> > > > > 
> > > > > Memory error on a dirty file mapped page.
> > > > 
> > > > But it is no longer dirty, and the problem was not that the data
> > > > was unable to be written back.
> > > 
> > > Sorry I don't understand. What do you mean with "no longer dirty"
> > > 
> > > Of course it's still dirty, just has to be discarded because it's 
> > > corrupted.
> > 
> > The pagecache location is no longer dirty. Userspace only cares
> > about pagecache locations and their contents, not the page that
> > was once there and has now been taken out.
> 
> Sorry, but that just sounds wrong to me. User space has no clue
> about the page cache, it just wants to know that the write it just

Umm, pagecache location is inode,offset, which is exactly what
userspace cares about, and they obviously know there can be a
writeback cache there because that's why fsync exists.


> did didn't reach the disk. And that's what happened and
> what we report here.

I didn't reach the disk and the dirty data was destroyed and
will be recreated from some unknown (to userspace) state 
from the filesystem. If you can't see how this is different
to the rest of our EIO conditions, then I can't spell it out
any simpler.


> > > A new errno? Or something else? If yes what precisely?
> > 
> > Yeah a new errno would be nice. Precisely one to say that the
> > memory was corrupted.
> 
> Ok.  I firmly think a new errno is a bad idea because I don't
> want to explain to a lot of people how to fix their applications.
> Compatibility is important.

Fair enough, maybe EIO is the best option, but I just want
people to think about it.

 
> > > No it's not fine if they would handle EIO. e.g. consider
> > > a sophisticated database which likely has sophisticated
> > > IO error mechanisms too (e.g. only abort the current commit)
> > 
> > Umm, if it is a generic "this failed, we can abort" then why
> > would not this be the default case. The issue is if it does
> > something differnet specifically for EIO, and specifically
> > assuming the pagecache is still valid and dirty.
> 
> How would it make such an assumption?

I guess either way you have to make assumptions that the
app uses errno in a particular way.

 
> I assume that if an application does something with EIO it 
> can either retry a few times or give up. Both is ok here.

That's exactly the case where it is not OK, because the
dirty page was now removed from pagecache, so the subsequent
fsync is going to succeed and the app will think its dirty
data has hit disk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:03                         ` Nick Piggin
@ 2009-06-02 13:20                           ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:03:06PM +0200, Nick Piggin wrote:
> > > > > I'm suggesting that EIO is traditionally for when the data still
> > > > > dirty in pagecache and was not able to get back to backing
> > > > > store. Do you deny that?
> > > > 
> > > > Yes. That is exactly the case when memory-failure triggers EIO
> > > > 
> > > > Memory error on a dirty file mapped page.
> > > 
> > > But it is no longer dirty, and the problem was not that the data
> > > was unable to be written back.
> > 
> > Sorry I don't understand. What do you mean with "no longer dirty"
> > 
> > Of course it's still dirty, just has to be discarded because it's 
> > corrupted.
> 
> The pagecache location is no longer dirty. Userspace only cares
> about pagecache locations and their contents, not the page that
> was once there and has now been taken out.

Sorry, but that just sounds wrong to me. User space has no clue
about the page cache, it just wants to know that the write it just
did didn't reach the disk. And that's what happened and
what we report here.

Retries can make sense in some cases, but in these cases the
user space should do it in other EIO cases too.

> > > > > And I think the application might try to handle the case of a
> > > > > page becoming corrupted differently. Do you deny that?
> > > > 
> > > > You mean a clean file-mapped page? In this case there is no EIO,
> > > > memory-failure just drops the page and it is reloaded.
> > > > 
> > > > If the page is dirty we trigger EIO which as you said above is the 
> > > > right reaction.
> > > 
> > > No I mean the difference between the case of dirty page unable to
> > > be written to backing sotre, and the case of dirty page becoming
> > > corrupted.
> > 
> > Nick, I have really a hard time following you here.
> > 
> > What exactly do you want? 
> > 
> > A new errno? Or something else? If yes what precisely?
> 
> Yeah a new errno would be nice. Precisely one to say that the
> memory was corrupted.

Ok.  I firmly think a new errno is a bad idea because I don't
want to explain to a lot of people how to fix their applications.
Compatibility is important.

> > I currently don't see any sane way to report this to the application
> > through write().  That is because adding a new errno for something
> > is incredibly hard and often impossible, and that's certainly
> > the case here.
> > 
> > The application can detect it if it maps the 
> > shared page and waits for a SIGBUS, but not through write().
> > 
> > But I doubt there will be really any apps that do anything differently
> > here anyways. A clever app could retry a few times if it still
> > has a copy of the data, but that might even make sense on normal
> > IO errors (e.g. on a SAN).
> 
> I'm sure some of the ones that really care would.

Even if they would there would be still old binaries around.
Compatibility is important.

> 
> 
> > > They would presumably exit or do some default thing, which I
> > > think would be fine.
> > 
> > No it's not fine if they would handle EIO. e.g. consider
> > a sophisticated database which likely has sophisticated
> > IO error mechanisms too (e.g. only abort the current commit)
> 
> Umm, if it is a generic "this failed, we can abort" then why
> would not this be the default case. The issue is if it does
> something differnet specifically for EIO, and specifically
> assuming the pagecache is still valid and dirty.

How would it make such an assumption?

I assume that if an application does something with EIO it 
can either retry a few times or give up. Both is ok here.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:20                           ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:03:06PM +0200, Nick Piggin wrote:
> > > > > I'm suggesting that EIO is traditionally for when the data still
> > > > > dirty in pagecache and was not able to get back to backing
> > > > > store. Do you deny that?
> > > > 
> > > > Yes. That is exactly the case when memory-failure triggers EIO
> > > > 
> > > > Memory error on a dirty file mapped page.
> > > 
> > > But it is no longer dirty, and the problem was not that the data
> > > was unable to be written back.
> > 
> > Sorry I don't understand. What do you mean with "no longer dirty"
> > 
> > Of course it's still dirty, just has to be discarded because it's 
> > corrupted.
> 
> The pagecache location is no longer dirty. Userspace only cares
> about pagecache locations and their contents, not the page that
> was once there and has now been taken out.

Sorry, but that just sounds wrong to me. User space has no clue
about the page cache, it just wants to know that the write it just
did didn't reach the disk. And that's what happened and
what we report here.

Retries can make sense in some cases, but in these cases the
user space should do it in other EIO cases too.

> > > > > And I think the application might try to handle the case of a
> > > > > page becoming corrupted differently. Do you deny that?
> > > > 
> > > > You mean a clean file-mapped page? In this case there is no EIO,
> > > > memory-failure just drops the page and it is reloaded.
> > > > 
> > > > If the page is dirty we trigger EIO which as you said above is the 
> > > > right reaction.
> > > 
> > > No I mean the difference between the case of dirty page unable to
> > > be written to backing sotre, and the case of dirty page becoming
> > > corrupted.
> > 
> > Nick, I have really a hard time following you here.
> > 
> > What exactly do you want? 
> > 
> > A new errno? Or something else? If yes what precisely?
> 
> Yeah a new errno would be nice. Precisely one to say that the
> memory was corrupted.

Ok.  I firmly think a new errno is a bad idea because I don't
want to explain to a lot of people how to fix their applications.
Compatibility is important.

> > I currently don't see any sane way to report this to the application
> > through write().  That is because adding a new errno for something
> > is incredibly hard and often impossible, and that's certainly
> > the case here.
> > 
> > The application can detect it if it maps the 
> > shared page and waits for a SIGBUS, but not through write().
> > 
> > But I doubt there will be really any apps that do anything differently
> > here anyways. A clever app could retry a few times if it still
> > has a copy of the data, but that might even make sense on normal
> > IO errors (e.g. on a SAN).
> 
> I'm sure some of the ones that really care would.

Even if they would there would be still old binaries around.
Compatibility is important.

> 
> 
> > > They would presumably exit or do some default thing, which I
> > > think would be fine.
> > 
> > No it's not fine if they would handle EIO. e.g. consider
> > a sophisticated database which likely has sophisticated
> > IO error mechanisms too (e.g. only abort the current commit)
> 
> Umm, if it is a generic "this failed, we can abort" then why
> would not this be the default case. The issue is if it does
> something differnet specifically for EIO, and specifically
> assuming the pagecache is still valid and dirty.

How would it make such an assumption?

I assume that if an application does something with EIO it 
can either retry a few times or give up. Both is ok here.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:25                       ` Andi Kleen
@ 2009-06-02 13:24                         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > not a big deal and just avoids duplicating code. I attached an
> > > > (untested) patch.
> > > 
> > > Thanks. But the function in the patch is not doing the same what
> > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > checking, as in the second try_to_release_page()
> > > 
> > > Then it doesn't do all the IO error and missing mapping handling.
> > 
> > Obviously I don't mean just use that single call for the entire
> > handler. You can set the EIO bit or whatever you like. The
> > "error handling" you have there also seems strange. You could
> > retain it, but the page is assured to be removed from pagecache.
> 
> The reason this code double checks is that someone could have 
> a reference (remember we can come in any time) we cannot kill immediately.

Can't kill what? The page is gone from pagecache. It may remain
other kernel references, but I don't see why this code will
consider this as a failure (and not, for example, a raised error
count).

 
> > > The page_mapped() check is useless because the pages are not 
> > > mapped here etc.
> > 
> > That's OK, it is a core part of the protocol to prevent
> > truncated pages from being mapped, so I like it to be in
> > that function.
> > 
> > (you are also doing extraneous page_mapped tests in your handler,
> > so surely your concern isn't from the perspective of this
> > error handler code)
> 
> We do page_mapping() checks, not page_mapped checks.
> 
> I know details, but ...

+static int me_pagecache_clean(struct page *p)
+{
+       if (!isolate_lru_page(p))
+               page_cache_release(p);
+
+       if (page_has_private(p))
+               do_invalidatepage(p, 0);
+       if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+               Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+                       page_to_pfn(p));
+
+       /*
+        * remove_from_page_cache assumes (mapping && !mapped)
+        */
+       if (page_mapping(p) && !page_mapped(p)) {
                               ^^^^^^^^^^^^^^^

+               remove_from_page_cache(p);
+               page_cache_release(p);
+       }
+
+       return RECOVERED;


> > > we would also need to duplicate most of the checking outside
> > > the function anyways and there wouldn't be any possibility
> > > to share the clean/dirty variants. If you insist I can
> > > do it, but I think it would be significantly worse code
> > > than before and I'm reluctant to do that.
> > 
> > I can write you the patch for that too if you like.
> 
> Ok I will write it, but I will add a comment saying that Nick forced
> me to make the code worse @)
> 
> It'll be fairly redundant at least.

If it's that bad, then I'll be happy to rewrite it for you.

 
> > > > if you already have other large ones.
> > > 
> > > That's unclear too.
> > 
> > You can't do much about most kernel pages, and dirty metadata pages
> > are both going to cause big problems. User pagetable pages. Lots of
> > stuff.
> 
> User page tables was on the todo list, these are actually relatively
> easy. The biggest issue is to detect them.
> 
> Metadata would likely need file system callbacks, which I would like to 
> avoid at this point.

So I just don't know why you argue the point that you have lots
of large holes left.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:24                         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > not a big deal and just avoids duplicating code. I attached an
> > > > (untested) patch.
> > > 
> > > Thanks. But the function in the patch is not doing the same what
> > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > checking, as in the second try_to_release_page()
> > > 
> > > Then it doesn't do all the IO error and missing mapping handling.
> > 
> > Obviously I don't mean just use that single call for the entire
> > handler. You can set the EIO bit or whatever you like. The
> > "error handling" you have there also seems strange. You could
> > retain it, but the page is assured to be removed from pagecache.
> 
> The reason this code double checks is that someone could have 
> a reference (remember we can come in any time) we cannot kill immediately.

Can't kill what? The page is gone from pagecache. It may remain
other kernel references, but I don't see why this code will
consider this as a failure (and not, for example, a raised error
count).

 
> > > The page_mapped() check is useless because the pages are not 
> > > mapped here etc.
> > 
> > That's OK, it is a core part of the protocol to prevent
> > truncated pages from being mapped, so I like it to be in
> > that function.
> > 
> > (you are also doing extraneous page_mapped tests in your handler,
> > so surely your concern isn't from the perspective of this
> > error handler code)
> 
> We do page_mapping() checks, not page_mapped checks.
> 
> I know details, but ...

+static int me_pagecache_clean(struct page *p)
+{
+       if (!isolate_lru_page(p))
+               page_cache_release(p);
+
+       if (page_has_private(p))
+               do_invalidatepage(p, 0);
+       if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+               Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+                       page_to_pfn(p));
+
+       /*
+        * remove_from_page_cache assumes (mapping && !mapped)
+        */
+       if (page_mapping(p) && !page_mapped(p)) {
                               ^^^^^^^^^^^^^^^

+               remove_from_page_cache(p);
+               page_cache_release(p);
+       }
+
+       return RECOVERED;


> > > we would also need to duplicate most of the checking outside
> > > the function anyways and there wouldn't be any possibility
> > > to share the clean/dirty variants. If you insist I can
> > > do it, but I think it would be significantly worse code
> > > than before and I'm reluctant to do that.
> > 
> > I can write you the patch for that too if you like.
> 
> Ok I will write it, but I will add a comment saying that Nick forced
> me to make the code worse @)
> 
> It'll be fairly redundant at least.

If it's that bad, then I'll be happy to rewrite it for you.

 
> > > > if you already have other large ones.
> > > 
> > > That's unclear too.
> > 
> > You can't do much about most kernel pages, and dirty metadata pages
> > are both going to cause big problems. User pagetable pages. Lots of
> > stuff.
> 
> User page tables was on the todo list, these are actually relatively
> easy. The biggest issue is to detect them.
> 
> Metadata would likely need file system callbacks, which I would like to 
> avoid at this point.

So I just don't know why you argue the point that you have lots
of large holes left.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:57                     ` Nick Piggin
@ 2009-06-02 13:25                       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > not a big deal and just avoids duplicating code. I attached an
> > > (untested) patch.
> > 
> > Thanks. But the function in the patch is not doing the same what
> > the me_pagecache_clean/dirty are doing. For once there is no error
> > checking, as in the second try_to_release_page()
> > 
> > Then it doesn't do all the IO error and missing mapping handling.
> 
> Obviously I don't mean just use that single call for the entire
> handler. You can set the EIO bit or whatever you like. The
> "error handling" you have there also seems strange. You could
> retain it, but the page is assured to be removed from pagecache.

The reason this code double checks is that someone could have 
a reference (remember we can come in any time) we cannot kill immediately.

> > The page_mapped() check is useless because the pages are not 
> > mapped here etc.
> 
> That's OK, it is a core part of the protocol to prevent
> truncated pages from being mapped, so I like it to be in
> that function.
> 
> (you are also doing extraneous page_mapped tests in your handler,
> so surely your concern isn't from the perspective of this
> error handler code)

We do page_mapping() checks, not page_mapped checks.

I know details, but ...

> 
> 
> > We could probably call truncate_complete_page(), but then
> > we would also need to duplicate most of the checking outside
> > the function anyways and there wouldn't be any possibility
> > to share the clean/dirty variants. If you insist I can
> > do it, but I think it would be significantly worse code
> > than before and I'm reluctant to do that.
> 
> I can write you the patch for that too if you like.

Ok I will write it, but I will add a comment saying that Nick forced
me to make the code worse @)

It'll be fairly redundant at least.

> > > if you already have other large ones.
> > 
> > That's unclear too.
> 
> You can't do much about most kernel pages, and dirty metadata pages
> are both going to cause big problems. User pagetable pages. Lots of
> stuff.

User page tables was on the todo list, these are actually relatively
easy. The biggest issue is to detect them.

Metadata would likely need file system callbacks, which I would like to 
avoid at this point.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:25                       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > not a big deal and just avoids duplicating code. I attached an
> > > (untested) patch.
> > 
> > Thanks. But the function in the patch is not doing the same what
> > the me_pagecache_clean/dirty are doing. For once there is no error
> > checking, as in the second try_to_release_page()
> > 
> > Then it doesn't do all the IO error and missing mapping handling.
> 
> Obviously I don't mean just use that single call for the entire
> handler. You can set the EIO bit or whatever you like. The
> "error handling" you have there also seems strange. You could
> retain it, but the page is assured to be removed from pagecache.

The reason this code double checks is that someone could have 
a reference (remember we can come in any time) we cannot kill immediately.

> > The page_mapped() check is useless because the pages are not 
> > mapped here etc.
> 
> That's OK, it is a core part of the protocol to prevent
> truncated pages from being mapped, so I like it to be in
> that function.
> 
> (you are also doing extraneous page_mapped tests in your handler,
> so surely your concern isn't from the perspective of this
> error handler code)

We do page_mapping() checks, not page_mapped checks.

I know details, but ...

> 
> 
> > We could probably call truncate_complete_page(), but then
> > we would also need to duplicate most of the checking outside
> > the function anyways and there wouldn't be any possibility
> > to share the clean/dirty variants. If you insist I can
> > do it, but I think it would be significantly worse code
> > than before and I'm reluctant to do that.
> 
> I can write you the patch for that too if you like.

Ok I will write it, but I will add a comment saying that Nick forced
me to make the code worse @)

It'll be fairly redundant at least.

> > > if you already have other large ones.
> > 
> > That's unclear too.
> 
> You can't do much about most kernel pages, and dirty metadata pages
> are both going to cause big problems. User pagetable pages. Lots of
> stuff.

User page tables was on the todo list, these are actually relatively
easy. The biggest issue is to detect them.

Metadata would likely need file system callbacks, which I would like to 
avoid at this point.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:37                     ` Nick Piggin
@ 2009-06-02 13:30                       ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:37:20PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:34:50PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:10:31PM +0200, Nick Piggin wrote:
> > > > It's not, there are various differences (like the reference count)
> > > 
> > > No. If there are, then it *really* needs better documentation. I
> > > don't think there are, though.
> > 
> > Better documentation on what? You want a detailed listing in a comment
> > how it is different from truncate?
> > 
> > To be honest I have some doubts of the usefulness of such a comment
> > (why stop at truncate and not list the differences to every other
> > page cache operation? @) but if you're insist (do you?) I can add one.
> 
> Because I don't see any difference (see my previous patch). I
> still don't know what it is supposed to be doing differently.
> So if you reinvent your own that looks close enough to truncate
> to warrant a comment to say /* this is close to truncate but
> not quite */, then yes I insist that you say exactly why it is
> not quite like truncate ;)

The truncate topic goes boring.  EIO is more interesting and imminent, hehe.

> > > I'm suggesting that EIO is traditionally for when the data still
> > > dirty in pagecache and was not able to get back to backing
> > > store. Do you deny that?
> > 
> > Yes. That is exactly the case when memory-failure triggers EIO
> > 
> > Memory error on a dirty file mapped page.
> 
> But it is no longer dirty, and the problem was not that the data
> was unable to be written back.

Or rather, cannot be written back ;)

> > > And I think the application might try to handle the case of a
> > > page becoming corrupted differently. Do you deny that?
> > 
> > You mean a clean file-mapped page? In this case there is no EIO,
> > memory-failure just drops the page and it is reloaded.
> > 
> > If the page is dirty we trigger EIO which as you said above is the 
> > right reaction.
> 
> No I mean the difference between the case of dirty page unable to
> be written to backing sotre, and the case of dirty page becoming
> corrupted.

legacy EIO:   may success on (do something then) retry?
hwpoison EIO: a permanent unrecoverable error

> > > OK, given the range of errors that APIs are defined to return,
> > > then maybe EIO is the best option. I don't suppose it is possible
> > > to expand them to return something else?
> > 
> > Expand the syscalls to return other errnos on specific
> > kinds of IO error?
> >  
> > Of course that's possible, but it has the problem that you 
> > would need to fix all the applications that expect EIO for
> > IO error. The later I consider infeasible.
> 
> They would presumably exit or do some default thing, which I
> think would be fine. Actually if your code catches them in the
> act of manipulating a corrupted page (ie. if it is mmapped),
> then it gets a SIGBUS.

That's OK.  filemap_fault() returns VM_FAULT_SIGBUS for legacy EIO,
while hwpoison pages will return VM_FAULT_HWPOISON. Both kills the
application I guess?

read()/write() are the more interesting cases.

With read IO interception, the read() call will succeed.

The write() call have to be failed. But interestingly writes are
mostly delayed ones, and we have only one AS_EIO bit for the entire
file, which will be cleared after the EIO reporting. And the poisoned
page will be isolated (if succeed) and later read()/write() calls
won't even notice there was a poisoned page!

How are we going to fix this mess? EIO errors seem to be fuzzy and
temporary by nature at least in the current implementation, and hard
to be improved to be exact and/or permanent in both implementation and
interface:
- can/shall we remember the exact EIO page? maybe not.
- can EIO reporting be permanent? sounds like a horrible user interface..


Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:30                       ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:37:20PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:34:50PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:10:31PM +0200, Nick Piggin wrote:
> > > > It's not, there are various differences (like the reference count)
> > > 
> > > No. If there are, then it *really* needs better documentation. I
> > > don't think there are, though.
> > 
> > Better documentation on what? You want a detailed listing in a comment
> > how it is different from truncate?
> > 
> > To be honest I have some doubts of the usefulness of such a comment
> > (why stop at truncate and not list the differences to every other
> > page cache operation? @) but if you're insist (do you?) I can add one.
> 
> Because I don't see any difference (see my previous patch). I
> still don't know what it is supposed to be doing differently.
> So if you reinvent your own that looks close enough to truncate
> to warrant a comment to say /* this is close to truncate but
> not quite */, then yes I insist that you say exactly why it is
> not quite like truncate ;)

The truncate topic goes boring.  EIO is more interesting and imminent, hehe.

> > > I'm suggesting that EIO is traditionally for when the data still
> > > dirty in pagecache and was not able to get back to backing
> > > store. Do you deny that?
> > 
> > Yes. That is exactly the case when memory-failure triggers EIO
> > 
> > Memory error on a dirty file mapped page.
> 
> But it is no longer dirty, and the problem was not that the data
> was unable to be written back.

Or rather, cannot be written back ;)

> > > And I think the application might try to handle the case of a
> > > page becoming corrupted differently. Do you deny that?
> > 
> > You mean a clean file-mapped page? In this case there is no EIO,
> > memory-failure just drops the page and it is reloaded.
> > 
> > If the page is dirty we trigger EIO which as you said above is the 
> > right reaction.
> 
> No I mean the difference between the case of dirty page unable to
> be written to backing sotre, and the case of dirty page becoming
> corrupted.

legacy EIO:   may success on (do something then) retry?
hwpoison EIO: a permanent unrecoverable error

> > > OK, given the range of errors that APIs are defined to return,
> > > then maybe EIO is the best option. I don't suppose it is possible
> > > to expand them to return something else?
> > 
> > Expand the syscalls to return other errnos on specific
> > kinds of IO error?
> >  
> > Of course that's possible, but it has the problem that you 
> > would need to fix all the applications that expect EIO for
> > IO error. The later I consider infeasible.
> 
> They would presumably exit or do some default thing, which I
> think would be fine. Actually if your code catches them in the
> act of manipulating a corrupted page (ie. if it is mmapped),
> then it gets a SIGBUS.

That's OK.  filemap_fault() returns VM_FAULT_SIGBUS for legacy EIO,
while hwpoison pages will return VM_FAULT_HWPOISON. Both kills the
application I guess?

read()/write() are the more interesting cases.

With read IO interception, the read() call will succeed.

The write() call have to be failed. But interestingly writes are
mostly delayed ones, and we have only one AS_EIO bit for the entire
file, which will be cleared after the EIO reporting. And the poisoned
page will be isolated (if succeed) and later read()/write() calls
won't even notice there was a poisoned page!

How are we going to fix this mess? EIO errors seem to be fuzzy and
temporary by nature at least in the current implementation, and hard
to be improved to be exact and/or permanent in both implementation and
interface:
- can/shall we remember the exact EIO page? maybe not.
- can EIO reporting be permanent? sounds like a horrible user interface..


Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:41                           ` Andi Kleen
@ 2009-06-02 13:40                             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 03:41:26PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > > The reason this code double checks is that someone could have 
> > > a reference (remember we can come in any time) we cannot kill immediately.
> > 
> > Can't kill what? The page is gone from pagecache. It may remain
> > other kernel references, but I don't see why this code will
> > consider this as a failure (and not, for example, a raised error
> > count).
> 
> It's a failure because the page was still used and not successfully
> isolated.

But you're predicating success on page_count, so there can be
other users anyway. You do check page_count later and emit
a different message in this case, but even that isn't enough
to tell you if it has no more users.

I wouldn't have thought it's worth the complication, but
there is nothing preventing you using my truncate function
and also keeping this error check to test afterwards.
 

> > +        * remove_from_page_cache assumes (mapping && !mapped)
> > +        */
> > +       if (page_mapping(p) && !page_mapped(p)) {
> 
> Ok you're right. That one is not needed. I will remove it.
> 
> > > 
> > > User page tables was on the todo list, these are actually relatively
> > > easy. The biggest issue is to detect them.
> > > 
> > > Metadata would likely need file system callbacks, which I would like to 
> > > avoid at this point.
> > 
> > So I just don't know why you argue the point that you have lots
> > of large holes left.
> 
> I didn't argue that. My point was just that I currently don't have
> data what holes are the worst on given workloads. If I figure out at
> some point that writeback pages are a significant part of some important
> workload I would be interested in tackling them.
> That said I think that's unlikely, but I'm not ruling it out.

Well, it sounds like maybe there is a sane way to do them with your
IO interception... but anyway let's not worry about this right
now ;)

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:40                             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 03:41:26PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > > The reason this code double checks is that someone could have 
> > > a reference (remember we can come in any time) we cannot kill immediately.
> > 
> > Can't kill what? The page is gone from pagecache. It may remain
> > other kernel references, but I don't see why this code will
> > consider this as a failure (and not, for example, a raised error
> > count).
> 
> It's a failure because the page was still used and not successfully
> isolated.

But you're predicating success on page_count, so there can be
other users anyway. You do check page_count later and emit
a different message in this case, but even that isn't enough
to tell you if it has no more users.

I wouldn't have thought it's worth the complication, but
there is nothing preventing you using my truncate function
and also keeping this error check to test afterwards.
 

> > +        * remove_from_page_cache assumes (mapping && !mapped)
> > +        */
> > +       if (page_mapping(p) && !page_mapped(p)) {
> 
> Ok you're right. That one is not needed. I will remove it.
> 
> > > 
> > > User page tables was on the todo list, these are actually relatively
> > > easy. The biggest issue is to detect them.
> > > 
> > > Metadata would likely need file system callbacks, which I would like to 
> > > avoid at this point.
> > 
> > So I just don't know why you argue the point that you have lots
> > of large holes left.
> 
> I didn't argue that. My point was just that I currently don't have
> data what holes are the worst on given workloads. If I figure out at
> some point that writeback pages are a significant part of some important
> workload I would be interested in tackling them.
> That said I think that's unlikely, but I'm not ruling it out.

Well, it sounds like maybe there is a sane way to do them with your
IO interception... but anyway let's not worry about this right
now ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:24                         ` Nick Piggin
@ 2009-06-02 13:41                           ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > > not a big deal and just avoids duplicating code. I attached an
> > > > > (untested) patch.
> > > > 
> > > > Thanks. But the function in the patch is not doing the same what
> > > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > > checking, as in the second try_to_release_page()
> > > > 
> > > > Then it doesn't do all the IO error and missing mapping handling.
> > > 
> > > Obviously I don't mean just use that single call for the entire
> > > handler. You can set the EIO bit or whatever you like. The
> > > "error handling" you have there also seems strange. You could
> > > retain it, but the page is assured to be removed from pagecache.
> > 
> > The reason this code double checks is that someone could have 
> > a reference (remember we can come in any time) we cannot kill immediately.
> 
> Can't kill what? The page is gone from pagecache. It may remain
> other kernel references, but I don't see why this code will
> consider this as a failure (and not, for example, a raised error
> count).

It's a failure because the page was still used and not successfully
isolated.

> +        * remove_from_page_cache assumes (mapping && !mapped)
> +        */
> +       if (page_mapping(p) && !page_mapped(p)) {

Ok you're right. That one is not needed. I will remove it.

> > 
> > User page tables was on the todo list, these are actually relatively
> > easy. The biggest issue is to detect them.
> > 
> > Metadata would likely need file system callbacks, which I would like to 
> > avoid at this point.
> 
> So I just don't know why you argue the point that you have lots
> of large holes left.

I didn't argue that. My point was just that I currently don't have
data what holes are the worst on given workloads. If I figure out at
some point that writeback pages are a significant part of some important
workload I would be interested in tackling them.
That said I think that's unlikely, but I'm not ruling it out.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:41                           ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > > not a big deal and just avoids duplicating code. I attached an
> > > > > (untested) patch.
> > > > 
> > > > Thanks. But the function in the patch is not doing the same what
> > > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > > checking, as in the second try_to_release_page()
> > > > 
> > > > Then it doesn't do all the IO error and missing mapping handling.
> > > 
> > > Obviously I don't mean just use that single call for the entire
> > > handler. You can set the EIO bit or whatever you like. The
> > > "error handling" you have there also seems strange. You could
> > > retain it, but the page is assured to be removed from pagecache.
> > 
> > The reason this code double checks is that someone could have 
> > a reference (remember we can come in any time) we cannot kill immediately.
> 
> Can't kill what? The page is gone from pagecache. It may remain
> other kernel references, but I don't see why this code will
> consider this as a failure (and not, for example, a raised error
> count).

It's a failure because the page was still used and not successfully
isolated.

> +        * remove_from_page_cache assumes (mapping && !mapped)
> +        */
> +       if (page_mapping(p) && !page_mapped(p)) {

Ok you're right. That one is not needed. I will remove it.

> > 
> > User page tables was on the todo list, these are actually relatively
> > easy. The biggest issue is to detect them.
> > 
> > Metadata would likely need file system callbacks, which I would like to 
> > avoid at this point.
> 
> So I just don't know why you argue the point that you have lots
> of large holes left.

I didn't argue that. My point was just that I currently don't have
data what holes are the worst on given workloads. If I figure out at
some point that writeback pages are a significant part of some important
workload I would be interested in tackling them.
That said I think that's unlikely, but I'm not ruling it out.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:19                             ` Nick Piggin
@ 2009-06-02 13:46                               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:19:37PM +0200, Nick Piggin wrote:
> > I assume that if an application does something with EIO it 
> > can either retry a few times or give up. Both is ok here.
> 
> That's exactly the case where it is not OK, because the
> dirty page was now removed from pagecache, so the subsequent
> fsync is going to succeed and the app will think its dirty
> data has hit disk.

Ok that's a fair point -- that's a hole in my scheme. I don't
know of a good way to fix it though. Do you?

I suspect adding a new errno would break more cases than fixing
them.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:46                               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 13:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:19:37PM +0200, Nick Piggin wrote:
> > I assume that if an application does something with EIO it 
> > can either retry a few times or give up. Both is ok here.
> 
> That's exactly the case where it is not OK, because the
> dirty page was now removed from pagecache, so the subsequent
> fsync is going to succeed and the app will think its dirty
> data has hit disk.

Ok that's a fair point -- that's a hole in my scheme. I don't
know of a good way to fix it though. Do you?

I suspect adding a new errno would break more cases than fixing
them.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:57                     ` Nick Piggin
@ 2009-06-02 13:46                       ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:57:13PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > > not a big deal and just avoids duplicating code. I attached an
> > > (untested) patch.
> > 
> > Thanks. But the function in the patch is not doing the same what
> > the me_pagecache_clean/dirty are doing. For once there is no error
> > checking, as in the second try_to_release_page()
> > 
> > Then it doesn't do all the IO error and missing mapping handling.
> 
> Obviously I don't mean just use that single call for the entire
> handler. You can set the EIO bit or whatever you like. The
> "error handling" you have there also seems strange. You could
> retain it, but the page is assured to be removed from pagecache.

You mean this?

        if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
                return FAILED;

If page->private cannot be removed, that means some fs may start IO on it, so
we return FAILED.

> > The page_mapped() check is useless because the pages are not 
> > mapped here etc.
> 
> That's OK, it is a core part of the protocol to prevent
> truncated pages from being mapped, so I like it to be in
> that function.
 
Right.

> (you are also doing extraneous page_mapped tests in your handler,
> so surely your concern isn't from the perspective of this
> error handler code)
 
That's because the initial try_to_unmap() may fail and page still remain
mapped, and remove_from_page_cache() assumes !page_mapped(). 

> > We could probably call truncate_complete_page(), but then
> > we would also need to duplicate most of the checking outside
> > the function anyways and there wouldn't be any possibility
> > to share the clean/dirty variants. If you insist I can
> > do it, but I think it would be significantly worse code
> > than before and I'm reluctant to do that.
> 
> I can write you the patch for that too if you like.

I have already posted one on truncate_complete_page(). Not the way you want it?
 
> > I don't also really see what the big deal is of just
> > calling these few functions directly. After all we're not
> > truncating here and they're all already called from other files.
> >
> > > > > No, it seems rather insane to do something like this here that no other
> > > > > code in the mm ever does.
> > > > 
> > > > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> > > 
> > > It is going to be possible to do it somehow surely, but it is insane
> > > to try to add such constraints to the VM to close a few small windows
> > 
> > We don't know currently if they are small. If they are small I would
> > agree with you, but that needs numbers. That said fancy writeback handling
> > is currently not on my agenda.
> 
> Yes, writeback pages are very limited, a tiny number at any time and
> the faction gets relatively smaller as total RAM size gets larger.

Yes they are less interesting for now.
 
> > > if you already have other large ones.
> > 
> > That's unclear too.
> 
> You can't do much about most kernel pages, and dirty metadata pages
> are both going to cause big problems. User pagetable pages. Lots of
> stuff.

Yes, that's a network of pointers that's hard to break away with.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:46                       ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:57:13PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > > not a big deal and just avoids duplicating code. I attached an
> > > (untested) patch.
> > 
> > Thanks. But the function in the patch is not doing the same what
> > the me_pagecache_clean/dirty are doing. For once there is no error
> > checking, as in the second try_to_release_page()
> > 
> > Then it doesn't do all the IO error and missing mapping handling.
> 
> Obviously I don't mean just use that single call for the entire
> handler. You can set the EIO bit or whatever you like. The
> "error handling" you have there also seems strange. You could
> retain it, but the page is assured to be removed from pagecache.

You mean this?

        if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
                return FAILED;

If page->private cannot be removed, that means some fs may start IO on it, so
we return FAILED.

> > The page_mapped() check is useless because the pages are not 
> > mapped here etc.
> 
> That's OK, it is a core part of the protocol to prevent
> truncated pages from being mapped, so I like it to be in
> that function.
 
Right.

> (you are also doing extraneous page_mapped tests in your handler,
> so surely your concern isn't from the perspective of this
> error handler code)
 
That's because the initial try_to_unmap() may fail and page still remain
mapped, and remove_from_page_cache() assumes !page_mapped(). 

> > We could probably call truncate_complete_page(), but then
> > we would also need to duplicate most of the checking outside
> > the function anyways and there wouldn't be any possibility
> > to share the clean/dirty variants. If you insist I can
> > do it, but I think it would be significantly worse code
> > than before and I'm reluctant to do that.
> 
> I can write you the patch for that too if you like.

I have already posted one on truncate_complete_page(). Not the way you want it?
 
> > I don't also really see what the big deal is of just
> > calling these few functions directly. After all we're not
> > truncating here and they're all already called from other files.
> >
> > > > > No, it seems rather insane to do something like this here that no other
> > > > > code in the mm ever does.
> > > > 
> > > > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> > > 
> > > It is going to be possible to do it somehow surely, but it is insane
> > > to try to add such constraints to the VM to close a few small windows
> > 
> > We don't know currently if they are small. If they are small I would
> > agree with you, but that needs numbers. That said fancy writeback handling
> > is currently not on my agenda.
> 
> Yes, writeback pages are very limited, a tiny number at any time and
> the faction gets relatively smaller as total RAM size gets larger.

Yes they are less interesting for now.
 
> > > if you already have other large ones.
> > 
> > That's unclear too.
> 
> You can't do much about most kernel pages, and dirty metadata pages
> are both going to cause big problems. User pagetable pages. Lots of
> stuff.

Yes, that's a network of pointers that's hard to break away with.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:46                               ` Andi Kleen
@ 2009-06-02 13:47                                 ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:46:10PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:19:37PM +0200, Nick Piggin wrote:
> > > I assume that if an application does something with EIO it 
> > > can either retry a few times or give up. Both is ok here.
> > 
> > That's exactly the case where it is not OK, because the
> > dirty page was now removed from pagecache, so the subsequent
> > fsync is going to succeed and the app will think its dirty
> > data has hit disk.
> 
> Ok that's a fair point -- that's a hole in my scheme. I don't
> know of a good way to fix it though. Do you?
> 
> I suspect adding a new errno would break more cases than fixing
> them.

Right, I wasn't too serious about the new errno (although maybe
others have opinions about the feasibility of that?). Because
I just don't know the full consequences.

I was kind of thinking about we could SIGKILL them as they try
to access it or fsync it. But then the question is how long to
keep SIGKILLing? At one end of the scale you could do stupid
and simple and have another error flag in the mapping to do
the SIGKILL just once for the next read/write/fsync etc. Or
at the other end, you keep the page in the pagecache and
poisoned, and kill everyone until the page is explicitly truncated
by userspace. I don't really know...


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:47                                 ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 13:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, akpm, chris.mason, linux-kernel, linux-mm, fengguang.wu

On Tue, Jun 02, 2009 at 03:46:10PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:19:37PM +0200, Nick Piggin wrote:
> > > I assume that if an application does something with EIO it 
> > > can either retry a few times or give up. Both is ok here.
> > 
> > That's exactly the case where it is not OK, because the
> > dirty page was now removed from pagecache, so the subsequent
> > fsync is going to succeed and the app will think its dirty
> > data has hit disk.
> 
> Ok that's a fair point -- that's a hole in my scheme. I don't
> know of a good way to fix it though. Do you?
> 
> I suspect adding a new errno would break more cases than fixing
> them.

Right, I wasn't too serious about the new errno (although maybe
others have opinions about the feasibility of that?). Because
I just don't know the full consequences.

I was kind of thinking about we could SIGKILL them as they try
to access it or fsync it. But then the question is how long to
keep SIGKILLing? At one end of the scale you could do stupid
and simple and have another error flag in the mapping to do
the SIGKILL just once for the next read/write/fsync etc. Or
at the other end, you keep the page in the pagecache and
poisoned, and kill everyone until the page is explicitly truncated
by userspace. I don't really know...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:41                           ` Andi Kleen
@ 2009-06-02 13:53                             ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 09:41:26PM +0800, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > > On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > > > not a big deal and just avoids duplicating code. I attached an
> > > > > > (untested) patch.
> > > > > 
> > > > > Thanks. But the function in the patch is not doing the same what
> > > > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > > > checking, as in the second try_to_release_page()
> > > > > 
> > > > > Then it doesn't do all the IO error and missing mapping handling.
> > > > 
> > > > Obviously I don't mean just use that single call for the entire
> > > > handler. You can set the EIO bit or whatever you like. The
> > > > "error handling" you have there also seems strange. You could
> > > > retain it, but the page is assured to be removed from pagecache.
> > > 
> > > The reason this code double checks is that someone could have 
> > > a reference (remember we can come in any time) we cannot kill immediately.
> > 
> > Can't kill what? The page is gone from pagecache. It may remain
> > other kernel references, but I don't see why this code will
> > consider this as a failure (and not, for example, a raised error
> > count).
> 
> It's a failure because the page was still used and not successfully
> isolated.
> 
> > +        * remove_from_page_cache assumes (mapping && !mapped)
> > +        */
> > +       if (page_mapping(p) && !page_mapped(p)) {
> 
> Ok you're right. That one is not needed. I will remove it.

No! Please read the comment. In fact __remove_from_page_cache() has a

                BUG_ON(page_mapped(page));

Or, at least correct that BUG_ON() line together.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 13:53                             ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 13:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 09:41:26PM +0800, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > > On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > > > not a big deal and just avoids duplicating code. I attached an
> > > > > > (untested) patch.
> > > > > 
> > > > > Thanks. But the function in the patch is not doing the same what
> > > > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > > > checking, as in the second try_to_release_page()
> > > > > 
> > > > > Then it doesn't do all the IO error and missing mapping handling.
> > > > 
> > > > Obviously I don't mean just use that single call for the entire
> > > > handler. You can set the EIO bit or whatever you like. The
> > > > "error handling" you have there also seems strange. You could
> > > > retain it, but the page is assured to be removed from pagecache.
> > > 
> > > The reason this code double checks is that someone could have 
> > > a reference (remember we can come in any time) we cannot kill immediately.
> > 
> > Can't kill what? The page is gone from pagecache. It may remain
> > other kernel references, but I don't see why this code will
> > consider this as a failure (and not, for example, a raised error
> > count).
> 
> It's a failure because the page was still used and not successfully
> isolated.
> 
> > +        * remove_from_page_cache assumes (mapping && !mapped)
> > +        */
> > +       if (page_mapping(p) && !page_mapped(p)) {
> 
> Ok you're right. That one is not needed. I will remove it.

No! Please read the comment. In fact __remove_from_page_cache() has a

                BUG_ON(page_mapped(page));

Or, at least correct that BUG_ON() line together.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:47                                 ` Nick Piggin
@ 2009-06-02 14:05                                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 14:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

> I was kind of thinking about we could SIGKILL them as they try
> to access it or fsync it. But then the question is how long to
> keep SIGKILLing? At one end of the scale you could do stupid
> and simple and have another error flag in the mapping to do
> the SIGKILL just once for the next read/write/fsync etc. Or

It's pretty radical to SIGKILL on a IO error.

Perhaps we can make fsync give EIO again in this case 
with a new mapping flag. The question would be when
to clear that flag again. Probably devil in the details.

> at the other end, you keep the page in the pagecache and
> poisoned, and kill everyone until the page is explicitly truncated
> by userspace. I don't really know...

We do that for the swapcache to avoid a similar problem, but
it's more a hack than a good solution.  I think it would be
worse for the page cache, because if you stop the program
then there's no reason to keep that around.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:05                                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 14:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm, fengguang.wu

> I was kind of thinking about we could SIGKILL them as they try
> to access it or fsync it. But then the question is how long to
> keep SIGKILLing? At one end of the scale you could do stupid
> and simple and have another error flag in the mapping to do
> the SIGKILL just once for the next read/write/fsync etc. Or

It's pretty radical to SIGKILL on a IO error.

Perhaps we can make fsync give EIO again in this case 
with a new mapping flag. The question would be when
to clear that flag again. Probably devil in the details.

> at the other end, you keep the page in the pagecache and
> poisoned, and kill everyone until the page is explicitly truncated
> by userspace. I don't really know...

We do that for the swapcache to avoid a similar problem, but
it's more a hack than a good solution.  I think it would be
worse for the page cache, because if you stop the program
then there's no reason to keep that around.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:53                             ` Wu Fengguang
@ 2009-06-02 14:06                               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 14:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> > Ok you're right. That one is not needed. I will remove it.
> 
> No! Please read the comment. In fact __remove_from_page_cache() has a
> 
>                 BUG_ON(page_mapped(page));
> 
> Or, at least correct that BUG_ON() line together.

Yes, but we already have them unmapped earlier and the poison check
in the page fault handler should prevent remapping.

So it really should not happen and if it happened we would deserve
the BUG.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:06                               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 14:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> > Ok you're right. That one is not needed. I will remove it.
> 
> No! Please read the comment. In fact __remove_from_page_cache() has a
> 
>                 BUG_ON(page_mapped(page));
> 
> Or, at least correct that BUG_ON() line together.

Yes, but we already have them unmapped earlier and the poison check
in the page fault handler should prevent remapping.

So it really should not happen and if it happened we would deserve
the BUG.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:30                       ` Wu Fengguang
@ 2009-06-02 14:07                         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 09:30:19PM +0800, Wu Fengguang wrote:
> > No I mean the difference between the case of dirty page unable to
> > be written to backing sotre, and the case of dirty page becoming
> > corrupted.
> 
> legacy EIO:   may success on (do something then) retry?

Legacy EIO yes, I imagine most programs are assuming that the
cache is still the most recent (and valid) copy of the data.


> hwpoison EIO: a permanent unrecoverable error
>
> > They would presumably exit or do some default thing, which I
> > think would be fine. Actually if your code catches them in the
> > act of manipulating a corrupted page (ie. if it is mmapped),
> > then it gets a SIGBUS.
> 
> That's OK.  filemap_fault() returns VM_FAULT_SIGBUS for legacy EIO,
> while hwpoison pages will return VM_FAULT_HWPOISON. Both kills the
> application I guess?

Yes I was just using it to illustrate the difference. filemap_fault
does SIGBUS for read failures, sure, but if you msync and get an
EIO (legacy EIO), then it is not going to SIGBUS to all procs mapping
the page.

 
> read()/write() are the more interesting cases.

Yes.

 
> With read IO interception, the read() call will succeed.
> 
> The write() call have to be failed. But interestingly writes are
> mostly delayed ones, and we have only one AS_EIO bit for the entire
> file, which will be cleared after the EIO reporting. And the poisoned
> page will be isolated (if succeed) and later read()/write() calls
> won't even notice there was a poisoned page!
> 
> How are we going to fix this mess? EIO errors seem to be fuzzy and
> temporary by nature at least in the current implementation, and hard

Well that is a problem too. It is questionable how long to keep
legacy EIO reporting around (I'm of the opinion that we really
need to keep them around forever and wait for either truncate or
add a new syscall to discard them). But this is another discussion
because we already have these existing semantics, so little point
to quickly change them :) 


> to be improved to be exact and/or permanent in both implementation and
> interface:
> - can/shall we remember the exact EIO page? maybe not.

If you add a new bit in the mapping, you could then call to the
error recovery code to do slowpath checking for overlapping page
offsets. It gets tricky if you want to allow the inode to be
reclaimed and still remember the errors ;)

> - can EIO reporting be permanent? sounds like a horrible user interface..

[Let's describe the ideal world:
 We'd have EBADMEM that everyone knows about, and we have a syscall
 that can clear these errors/bad pages. Maybe even another syscall
 which can read back the contents of this memory without being SIGBUSed
 or EBADMEMed.]

Now I have been of the the opinion that our current (legacy) EIO should
be permanent (unless the pages end up being able to be written back),
and we should have another syscall to clear this condition.

Unaware applications may have some difficulties, but a cmd line utility
can clear these so it could easily be recovered...

I think this might work for hwpoison as well (whether it ends up using
EIO or something else).


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:07                         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 09:30:19PM +0800, Wu Fengguang wrote:
> > No I mean the difference between the case of dirty page unable to
> > be written to backing sotre, and the case of dirty page becoming
> > corrupted.
> 
> legacy EIO:   may success on (do something then) retry?

Legacy EIO yes, I imagine most programs are assuming that the
cache is still the most recent (and valid) copy of the data.


> hwpoison EIO: a permanent unrecoverable error
>
> > They would presumably exit or do some default thing, which I
> > think would be fine. Actually if your code catches them in the
> > act of manipulating a corrupted page (ie. if it is mmapped),
> > then it gets a SIGBUS.
> 
> That's OK.  filemap_fault() returns VM_FAULT_SIGBUS for legacy EIO,
> while hwpoison pages will return VM_FAULT_HWPOISON. Both kills the
> application I guess?

Yes I was just using it to illustrate the difference. filemap_fault
does SIGBUS for read failures, sure, but if you msync and get an
EIO (legacy EIO), then it is not going to SIGBUS to all procs mapping
the page.

 
> read()/write() are the more interesting cases.

Yes.

 
> With read IO interception, the read() call will succeed.
> 
> The write() call have to be failed. But interestingly writes are
> mostly delayed ones, and we have only one AS_EIO bit for the entire
> file, which will be cleared after the EIO reporting. And the poisoned
> page will be isolated (if succeed) and later read()/write() calls
> won't even notice there was a poisoned page!
> 
> How are we going to fix this mess? EIO errors seem to be fuzzy and
> temporary by nature at least in the current implementation, and hard

Well that is a problem too. It is questionable how long to keep
legacy EIO reporting around (I'm of the opinion that we really
need to keep them around forever and wait for either truncate or
add a new syscall to discard them). But this is another discussion
because we already have these existing semantics, so little point
to quickly change them :) 


> to be improved to be exact and/or permanent in both implementation and
> interface:
> - can/shall we remember the exact EIO page? maybe not.

If you add a new bit in the mapping, you could then call to the
error recovery code to do slowpath checking for overlapping page
offsets. It gets tricky if you want to allow the inode to be
reclaimed and still remember the errors ;)

> - can EIO reporting be permanent? sounds like a horrible user interface..

[Let's describe the ideal world:
 We'd have EBADMEM that everyone knows about, and we have a syscall
 that can clear these errors/bad pages. Maybe even another syscall
 which can read back the contents of this memory without being SIGBUSed
 or EBADMEMed.]

Now I have been of the the opinion that our current (legacy) EIO should
be permanent (unless the pages end up being able to be written back),
and we should have another syscall to clear this condition.

Unaware applications may have some difficulties, but a cmd line utility
can clear these so it could easily be recovered...

I think this might work for hwpoison as well (whether it ends up using
EIO or something else).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:46                       ` Wu Fengguang
@ 2009-06-02 14:08                         ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 14:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> > > We could probably call truncate_complete_page(), but then
> > > we would also need to duplicate most of the checking outside
> > > the function anyways and there wouldn't be any possibility
> > > to share the clean/dirty variants. If you insist I can
> > > do it, but I think it would be significantly worse code
> > > than before and I'm reluctant to do that.
> > 
> > I can write you the patch for that too if you like.
> 
> I have already posted one on truncate_complete_page(). Not the way you want it?

Sorry I must have missed it (too much mail I guess). Can you repost please?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:08                         ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 14:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> > > We could probably call truncate_complete_page(), but then
> > > we would also need to duplicate most of the checking outside
> > > the function anyways and there wouldn't be any possibility
> > > to share the clean/dirty variants. If you insist I can
> > > do it, but I think it would be significantly worse code
> > > than before and I'm reluctant to do that.
> > 
> > I can write you the patch for that too if you like.
> 
> I have already posted one on truncate_complete_page(). Not the way you want it?

Sorry I must have missed it (too much mail I guess). Can you repost please?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 14:08                         ` Andi Kleen
@ 2009-06-02 14:10                           ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 14:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:08:30PM +0800, Andi Kleen wrote:
> > > > We could probably call truncate_complete_page(), but then
> > > > we would also need to duplicate most of the checking outside
> > > > the function anyways and there wouldn't be any possibility
> > > > to share the clean/dirty variants. If you insist I can
> > > > do it, but I think it would be significantly worse code
> > > > than before and I'm reluctant to do that.
> > > 
> > > I can write you the patch for that too if you like.
> > 
> > I have already posted one on truncate_complete_page(). Not the way you want it?
> 
> Sorry I must have missed it (too much mail I guess). Can you repost please?

OK, here it is, a more simplified one.

---
 mm/memory-failure.c |   13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -324,23 +324,16 @@ static int me_free(struct page *p)
  */
 static int me_pagecache_clean(struct page *p)
 {
+	if (page_mapping(p))
+                truncate_complete_page(p->mapping, p);
+
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
-	if (page_has_private(p))
-		do_invalidatepage(p, 0);
 	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
 		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
 			page_to_pfn(p));
 
-	/*
-	 * remove_from_page_cache assumes (mapping && !mapped)
-	 */
-	if (page_mapping(p) && !page_mapped(p)) {
-		remove_from_page_cache(p);
-		page_cache_release(p);
-	}
-
 	return RECOVERED;
 }
 

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:10                           ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 14:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:08:30PM +0800, Andi Kleen wrote:
> > > > We could probably call truncate_complete_page(), but then
> > > > we would also need to duplicate most of the checking outside
> > > > the function anyways and there wouldn't be any possibility
> > > > to share the clean/dirty variants. If you insist I can
> > > > do it, but I think it would be significantly worse code
> > > > than before and I'm reluctant to do that.
> > > 
> > > I can write you the patch for that too if you like.
> > 
> > I have already posted one on truncate_complete_page(). Not the way you want it?
> 
> Sorry I must have missed it (too much mail I guess). Can you repost please?

OK, here it is, a more simplified one.

---
 mm/memory-failure.c |   13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -324,23 +324,16 @@ static int me_free(struct page *p)
  */
 static int me_pagecache_clean(struct page *p)
 {
+	if (page_mapping(p))
+                truncate_complete_page(p->mapping, p);
+
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
-	if (page_has_private(p))
-		do_invalidatepage(p, 0);
 	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
 		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
 			page_to_pfn(p));
 
-	/*
-	 * remove_from_page_cache assumes (mapping && !mapped)
-	 */
-	if (page_mapping(p) && !page_mapped(p)) {
-		remove_from_page_cache(p);
-		page_cache_release(p);
-	}
-
 	return RECOVERED;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 14:06                               ` Andi Kleen
@ 2009-06-02 14:12                                 ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 14:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:06:39PM +0800, Andi Kleen wrote:
> > > Ok you're right. That one is not needed. I will remove it.
> > 
> > No! Please read the comment. In fact __remove_from_page_cache() has a
> > 
> >                 BUG_ON(page_mapped(page));
> > 
> > Or, at least correct that BUG_ON() line together.
> 
> Yes, but we already have them unmapped earlier and the poison check

But you commented "try_to_unmap can fail temporarily due to races."

That's self-contradictory.

> in the page fault handler should prevent remapping.
> 
> So it really should not happen and if it happened we would deserve
> the BUG.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:12                                 ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-02 14:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:06:39PM +0800, Andi Kleen wrote:
> > > Ok you're right. That one is not needed. I will remove it.
> > 
> > No! Please read the comment. In fact __remove_from_page_cache() has a
> > 
> >                 BUG_ON(page_mapped(page));
> > 
> > Or, at least correct that BUG_ON() line together.
> 
> Yes, but we already have them unmapped earlier and the poison check

But you commented "try_to_unmap can fail temporarily due to races."

That's self-contradictory.

> in the page fault handler should prevent remapping.
> 
> So it really should not happen and if it happened we would deserve
> the BUG.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 14:10                           ` Wu Fengguang
@ 2009-06-02 14:14                             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:10:31PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 10:08:30PM +0800, Andi Kleen wrote:
> > > > > We could probably call truncate_complete_page(), but then
> > > > > we would also need to duplicate most of the checking outside
> > > > > the function anyways and there wouldn't be any possibility
> > > > > to share the clean/dirty variants. If you insist I can
> > > > > do it, but I think it would be significantly worse code
> > > > > than before and I'm reluctant to do that.
> > > > 
> > > > I can write you the patch for that too if you like.
> > > 
> > > I have already posted one on truncate_complete_page(). Not the way you want it?
> > 
> > Sorry I must have missed it (too much mail I guess). Can you repost please?
> 
> OK, here it is, a more simplified one.

I prefer mine because I don't want truncate_complete_page escaping
mm/truncate.c because the caller has to deal with truncate races.


> ---
>  mm/memory-failure.c |   13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -324,23 +324,16 @@ static int me_free(struct page *p)
>   */
>  static int me_pagecache_clean(struct page *p)
>  {
> +	if (page_mapping(p))
> +                truncate_complete_page(p->mapping, p);
> +
>  	if (!isolate_lru_page(p))
>  		page_cache_release(p);
>  
> -	if (page_has_private(p))
> -		do_invalidatepage(p, 0);
>  	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
>  		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
>  			page_to_pfn(p));
>  
> -	/*
> -	 * remove_from_page_cache assumes (mapping && !mapped)
> -	 */
> -	if (page_mapping(p) && !page_mapped(p)) {
> -		remove_from_page_cache(p);
> -		page_cache_release(p);
> -	}
> -
>  	return RECOVERED;
>  }


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:14                             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:10:31PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 10:08:30PM +0800, Andi Kleen wrote:
> > > > > We could probably call truncate_complete_page(), but then
> > > > > we would also need to duplicate most of the checking outside
> > > > > the function anyways and there wouldn't be any possibility
> > > > > to share the clean/dirty variants. If you insist I can
> > > > > do it, but I think it would be significantly worse code
> > > > > than before and I'm reluctant to do that.
> > > > 
> > > > I can write you the patch for that too if you like.
> > > 
> > > I have already posted one on truncate_complete_page(). Not the way you want it?
> > 
> > Sorry I must have missed it (too much mail I guess). Can you repost please?
> 
> OK, here it is, a more simplified one.

I prefer mine because I don't want truncate_complete_page escaping
mm/truncate.c because the caller has to deal with truncate races.


> ---
>  mm/memory-failure.c |   13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -324,23 +324,16 @@ static int me_free(struct page *p)
>   */
>  static int me_pagecache_clean(struct page *p)
>  {
> +	if (page_mapping(p))
> +                truncate_complete_page(p->mapping, p);
> +
>  	if (!isolate_lru_page(p))
>  		page_cache_release(p);
>  
> -	if (page_has_private(p))
> -		do_invalidatepage(p, 0);
>  	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
>  		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
>  			page_to_pfn(p));
>  
> -	/*
> -	 * remove_from_page_cache assumes (mapping && !mapped)
> -	 */
> -	if (page_mapping(p) && !page_mapped(p)) {
> -		remove_from_page_cache(p);
> -		page_cache_release(p);
> -	}
> -
>  	return RECOVERED;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 14:12                                 ` Wu Fengguang
@ 2009-06-02 14:21                                   ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:12:22PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 10:06:39PM +0800, Andi Kleen wrote:
> > > > Ok you're right. That one is not needed. I will remove it.
> > > 
> > > No! Please read the comment. In fact __remove_from_page_cache() has a
> > > 
> > >                 BUG_ON(page_mapped(page));
> > > 
> > > Or, at least correct that BUG_ON() line together.
> > 
> > Yes, but we already have them unmapped earlier and the poison check
> 
> But you commented "try_to_unmap can fail temporarily due to races."
> 
> That's self-contradictory.

If you use the bloody code I posted (and suggested from the start),
then you DON'T HAVE TO WORRY ABOUT THIS, because it is handled by
the subsystem that knows about it.

How anybody can say it will make your code overcomplicated or "is
not much improvement" is just totally beyond me.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:21                                   ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 10:12:22PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 10:06:39PM +0800, Andi Kleen wrote:
> > > > Ok you're right. That one is not needed. I will remove it.
> > > 
> > > No! Please read the comment. In fact __remove_from_page_cache() has a
> > > 
> > >                 BUG_ON(page_mapped(page));
> > > 
> > > Or, at least correct that BUG_ON() line together.
> > 
> > Yes, but we already have them unmapped earlier and the poison check
> 
> But you commented "try_to_unmap can fail temporarily due to races."
> 
> That's self-contradictory.

If you use the bloody code I posted (and suggested from the start),
then you DON'T HAVE TO WORRY ABOUT THIS, because it is handled by
the subsystem that knows about it.

How anybody can say it will make your code overcomplicated or "is
not much improvement" is just totally beyond me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:51                       ` Wu Fengguang
@ 2009-06-02 14:33                         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:51:34PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > > But you just said that you try to intercept the IO. So the underlying
> > > > data is not necessarily corrupt. And even if it was then what if it
> > > > was reinitialized to something else in the meantime (such as filesystem
> > > > metadata blocks?) You'd just be introducing worse possibilities for
> > > > coruption.
> > > 
> > > The IO interception will be based on PFN instead of file offset, so it
> > > won't affect innocent pages such as your example of reinitialized data.
> > 
> > OK, if you could intercept the IO so it never happens at all, yes
> > of course that could work.
> > 
> > > poisoned dirty page == corrupt data      => process shall be killed
> > > poisoned clean page == recoverable data  => process shall survive
> > > 
> > > In the case of dirty hwpoison page, if we reload the on disk old data
> > > and let application proceed with it, it may lead to *silent* data
> > > corruption/inconsistency, because the application will first see v2
> > > then v1, which is illogical and hence may mess up its internal data
> > > structure.
> > 
> > Right, but how do you prevent that? There is no way to reconstruct the
> > most updtodate data because it was destroyed.
> 
> To kill the application ruthlessly, rather than allow it go rotten quietly.

Right, but you don't because you just do EIO in a lot of cases. See
EIO subthread.


> > > > You will need to demonstrate a *big* advantage before doing crazy things
> > > > with writeback ;)
> > > 
> > > OK. We can do two things about poisoned writeback pages:
> > > 
> > > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> > >    trigger further machine checks
> > 
> > 1b) At which point, you invoke the end-io handlers, and the page is
> > no longer writeback.
> > 
> > > 2) to isolate them from page cache, thus preventing possible
> > >    references in the writeback time window
> > 
> > And then this is possible because you aren't violating mm
> > assumptions due to 1b. This proceeds just as the existing
> > pagecache mce error handler case which exists now.
> 
> Yeah that's a good scheme - we are talking about two interception
> scheme. Mine is passive one and yours is active one.

Oh, hmm, not quite. I had assumed your IO interception is based
on another MCE from DMA transfer (Andi said you get another exception
in that case).

If you are just hoping to get an MCE from CPU access in order to
intercept IO, then you may as well not bother because it is not
closing the window much (very likely that the page will never be
touched again by the CPU).

So if you can get an MCE from the DMA, then you would fail the
request, which will automatically clear writeback, so your CPU MCE
handler never has to bother with writeback pages.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 14:33                         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 14:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 08:51:34PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > > But you just said that you try to intercept the IO. So the underlying
> > > > data is not necessarily corrupt. And even if it was then what if it
> > > > was reinitialized to something else in the meantime (such as filesystem
> > > > metadata blocks?) You'd just be introducing worse possibilities for
> > > > coruption.
> > > 
> > > The IO interception will be based on PFN instead of file offset, so it
> > > won't affect innocent pages such as your example of reinitialized data.
> > 
> > OK, if you could intercept the IO so it never happens at all, yes
> > of course that could work.
> > 
> > > poisoned dirty page == corrupt data      => process shall be killed
> > > poisoned clean page == recoverable data  => process shall survive
> > > 
> > > In the case of dirty hwpoison page, if we reload the on disk old data
> > > and let application proceed with it, it may lead to *silent* data
> > > corruption/inconsistency, because the application will first see v2
> > > then v1, which is illogical and hence may mess up its internal data
> > > structure.
> > 
> > Right, but how do you prevent that? There is no way to reconstruct the
> > most updtodate data because it was destroyed.
> 
> To kill the application ruthlessly, rather than allow it go rotten quietly.

Right, but you don't because you just do EIO in a lot of cases. See
EIO subthread.


> > > > You will need to demonstrate a *big* advantage before doing crazy things
> > > > with writeback ;)
> > > 
> > > OK. We can do two things about poisoned writeback pages:
> > > 
> > > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> > >    trigger further machine checks
> > 
> > 1b) At which point, you invoke the end-io handlers, and the page is
> > no longer writeback.
> > 
> > > 2) to isolate them from page cache, thus preventing possible
> > >    references in the writeback time window
> > 
> > And then this is possible because you aren't violating mm
> > assumptions due to 1b. This proceeds just as the existing
> > pagecache mce error handler case which exists now.
> 
> Yeah that's a good scheme - we are talking about two interception
> scheme. Mine is passive one and yours is active one.

Oh, hmm, not quite. I had assumed your IO interception is based
on another MCE from DMA transfer (Andi said you get another exception
in that case).

If you are just hoping to get an MCE from CPU access in order to
intercept IO, then you may as well not bother because it is not
closing the window much (very likely that the page will never be
touched again by the CPU).

So if you can get an MCE from the DMA, then you would fail the
request, which will automatically clear writeback, so your CPU MCE
handler never has to bother with writeback pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:47                   ` Andi Kleen
@ 2009-06-02 15:09                     ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 15:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:

[snip: reusing truncate.c code]

> > With all that writing you could have just done it. It's really
> 
> I would have done it if it made sense to me, but so far it hasn't.
> 
> The problem with your suggestion is that you do the big picture,
> but seem to skip over a lot of details. But details matter.

BTW. just to answer this point. The reason maybe for this
is because the default response to my concerns seems to
have been "you're wrong". Not "i don't understand, can you
detail", and not "i don't agree because ...".

I may well be wrong (in this case I'm quite sure I'm not),
but if you say I'm wrong, then I assume that you understand
what I'm talking about and have a fair idea of the details.

Anyway don't worry. I get that a lot. I do really want to
help get this merged.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 15:09                     ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 15:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:

[snip: reusing truncate.c code]

> > With all that writing you could have just done it. It's really
> 
> I would have done it if it made sense to me, but so far it hasn't.
> 
> The problem with your suggestion is that you do the big picture,
> but seem to skip over a lot of details. But details matter.

BTW. just to answer this point. The reason maybe for this
is because the default response to my concerns seems to
have been "you're wrong". Not "i don't understand, can you
detail", and not "i don't agree because ...".

I may well be wrong (in this case I'm quite sure I'm not),
but if you say I'm wrong, then I assume that you understand
what I'm talking about and have a fair idea of the details.

Anyway don't worry. I get that a lot. I do really want to
help get this merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 13:46                       ` Wu Fengguang
@ 2009-06-02 15:17                         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 15:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 09:46:59PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 08:57:13PM +0800, Nick Piggin wrote:
> > Obviously I don't mean just use that single call for the entire
> > handler. You can set the EIO bit or whatever you like. The
> > "error handling" you have there also seems strange. You could
> > retain it, but the page is assured to be removed from pagecache.
> 
> You mean this?
> 
>         if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
>                 return FAILED;
> 
> If page->private cannot be removed, that means some fs may start IO on it, so
> we return FAILED.

Hmm, if you're handling buffercache here then possibly yes.
But if you throw out dirty buffer cache then you're probably
corrupting your filesystem just as bad (or even worse than
a couple of bits flipped). Just seems ad-hoc.

I guess it is best-effort in most places though, and this
doesn't take much effort. But due to being best effort
means that it is hard for someone who knows exactly what all
the code does, to know what your intentions or intended
semantics are in places like this. So short comments would help,

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 15:17                         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-02 15:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 09:46:59PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 08:57:13PM +0800, Nick Piggin wrote:
> > Obviously I don't mean just use that single call for the entire
> > handler. You can set the EIO bit or whatever you like. The
> > "error handling" you have there also seems strange. You could
> > retain it, but the page is assured to be removed from pagecache.
> 
> You mean this?
> 
>         if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
>                 return FAILED;
> 
> If page->private cannot be removed, that means some fs may start IO on it, so
> we return FAILED.

Hmm, if you're handling buffercache here then possibly yes.
But if you throw out dirty buffer cache then you're probably
corrupting your filesystem just as bad (or even worse than
a couple of bits flipped). Just seems ad-hoc.

I guess it is best-effort in most places though, and this
doesn't take much effort. But due to being best effort
means that it is hard for someone who knows exactly what all
the code does, to know what your intentions or intended
semantics are in places like this. So short comments would help,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 15:09                     ` Nick Piggin
@ 2009-06-02 17:19                       ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 17:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 05:09:52PM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> 
> [snip: reusing truncate.c code]
> 
> > > With all that writing you could have just done it. It's really
> > 
> > I would have done it if it made sense to me, but so far it hasn't.
> > 
> > The problem with your suggestion is that you do the big picture,
> > but seem to skip over a lot of details. But details matter.
> 
> BTW. just to answer this point. The reason maybe for this
> is because the default response to my concerns seems to
> have been "you're wrong". Not "i don't understand, can you
> detail", and not "i don't agree because ...".

Sorry, I didn't want to imply you're wrong. I apologize if
it came over this way. I understand you understand this code
very well. I realize the one above came out 
a bit flamey, but it wasn't really intended like this.

The disagreement right now seems to be more how the 
code is structured. Typically there's no clear "right" or "wrong"
with these things anyways.

I'll take a look at your suggestion this evening and see
how it comes out.

> Anyway don't worry. I get that a lot. I do really want to
> help get this merged.

I wanted to thank you for your great reviewing work, even if I didn't
agree with everything :) But I think the disagreement were quite
small and only relatively unimportant things.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 17:19                       ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 17:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 05:09:52PM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> 
> [snip: reusing truncate.c code]
> 
> > > With all that writing you could have just done it. It's really
> > 
> > I would have done it if it made sense to me, but so far it hasn't.
> > 
> > The problem with your suggestion is that you do the big picture,
> > but seem to skip over a lot of details. But details matter.
> 
> BTW. just to answer this point. The reason maybe for this
> is because the default response to my concerns seems to
> have been "you're wrong". Not "i don't understand, can you
> detail", and not "i don't agree because ...".

Sorry, I didn't want to imply you're wrong. I apologize if
it came over this way. I understand you understand this code
very well. I realize the one above came out 
a bit flamey, but it wasn't really intended like this.

The disagreement right now seems to be more how the 
code is structured. Typically there's no clear "right" or "wrong"
with these things anyways.

I'll take a look at your suggestion this evening and see
how it comes out.

> Anyway don't worry. I get that a lot. I do really want to
> help get this merged.

I wanted to thank you for your great reviewing work, even if I didn't
agree with everything :) But I think the disagreement were quite
small and only relatively unimportant things.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 15:17                         ` Nick Piggin
@ 2009-06-02 17:27                           ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 17:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Wu Fengguang, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> Hmm, if you're handling buffercache here then possibly yes.

Good question, will check.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-02 17:27                           ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-02 17:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Wu Fengguang, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

> Hmm, if you're handling buffercache here then possibly yes.

Good question, will check.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 17:19                       ` Andi Kleen
@ 2009-06-03  6:24                         ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-03  6:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 07:19:15PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 05:09:52PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > 
> > [snip: reusing truncate.c code]
> > 
> > > > With all that writing you could have just done it. It's really
> > > 
> > > I would have done it if it made sense to me, but so far it hasn't.
> > > 
> > > The problem with your suggestion is that you do the big picture,
> > > but seem to skip over a lot of details. But details matter.
> > 
> > BTW. just to answer this point. The reason maybe for this
> > is because the default response to my concerns seems to
> > have been "you're wrong". Not "i don't understand, can you
> > detail", and not "i don't agree because ...".
> 
> Sorry, I didn't want to imply you're wrong. I apologize if
> it came over this way. I understand you understand this code
> very well. I realize the one above came out 
> a bit flamey, but it wasn't really intended like this.

Ah it's OK :) Actually that was too far, most of the time
actually you gave constructive responses. Just one or two
sticking points but probably I was getting carried away
as well. Nothing personal of course!

 
> I'll take a look at your suggestion this evening and see
> how it comes out.

Cool.

 
> > Anyway don't worry. I get that a lot. I do really want to
> > help get this merged.
> 
> I wanted to thank you for your great reviewing work, even if I didn't
> agree with everything :) But I think the disagreement were quite
> small and only relatively unimportant things.

Yes, I see nothing fundamentally wrong with the design...

Thanks,
Nick

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-03  6:24                         ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-03  6:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 07:19:15PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 05:09:52PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > 
> > [snip: reusing truncate.c code]
> > 
> > > > With all that writing you could have just done it. It's really
> > > 
> > > I would have done it if it made sense to me, but so far it hasn't.
> > > 
> > > The problem with your suggestion is that you do the big picture,
> > > but seem to skip over a lot of details. But details matter.
> > 
> > BTW. just to answer this point. The reason maybe for this
> > is because the default response to my concerns seems to
> > have been "you're wrong". Not "i don't understand, can you
> > detail", and not "i don't agree because ...".
> 
> Sorry, I didn't want to imply you're wrong. I apologize if
> it came over this way. I understand you understand this code
> very well. I realize the one above came out 
> a bit flamey, but it wasn't really intended like this.

Ah it's OK :) Actually that was too far, most of the time
actually you gave constructive responses. Just one or two
sticking points but probably I was getting carried away
as well. Nothing personal of course!

 
> I'll take a look at your suggestion this evening and see
> how it comes out.

Cool.

 
> > Anyway don't worry. I get that a lot. I do really want to
> > help get this merged.
> 
> I wanted to thank you for your great reviewing work, even if I didn't
> agree with everything :) But I think the disagreement were quite
> small and only relatively unimportant things.

Yes, I see nothing fundamentally wrong with the design...

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 17:27                           ` Andi Kleen
@ 2009-06-03  9:35                             ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-03  9:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 07:27:15PM +0200, Andi Kleen wrote:
> > Hmm, if you're handling buffercache here then possibly yes.
> 
> Good question, will check.

BTW. now that I think about it, buffercache is probably not a good
idea to truncate (truncate, as-in: remove from pagecache). Because
filesystems can assume that with just a reference on the page, then
it will not be truncated.

This code will cause ext2 (as the first one I looked at), to go
oops.

And this is not predicated on PagePrivate or page_has_buffers,
because filesystems are free to directly operate on their own
metadata buffercache pages.

So I think it would be a good idea to exclude buffercache from
here completely until it can be shown to be safe. Actually you
*can* use the invalidate_mapping_pages path, which will check
refcounts etc (or a derivative thereof, similarly to my truncate
patch).



^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-03  9:35                             ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-03  9:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 07:27:15PM +0200, Andi Kleen wrote:
> > Hmm, if you're handling buffercache here then possibly yes.
> 
> Good question, will check.

BTW. now that I think about it, buffercache is probably not a good
idea to truncate (truncate, as-in: remove from pagecache). Because
filesystems can assume that with just a reference on the page, then
it will not be truncated.

This code will cause ext2 (as the first one I looked at), to go
oops.

And this is not predicated on PagePrivate or page_has_buffers,
because filesystems are free to directly operate on their own
metadata buffercache pages.

So I think it would be a good idea to exclude buffercache from
here completely until it can be shown to be safe. Actually you
*can* use the invalidate_mapping_pages path, which will check
refcounts etc (or a derivative thereof, similarly to my truncate
patch).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-02 12:51                       ` Wu Fengguang
@ 2009-06-03 10:21                         ` Jens Axboe
  -1 siblings, 0 replies; 232+ messages in thread
From: Jens Axboe @ 2009-06-03 10:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02 2009, Wu Fengguang wrote:
> > And then this is possible because you aren't violating mm
> > assumptions due to 1b. This proceeds just as the existing
> > pagecache mce error handler case which exists now.
> 
> Yeah that's a good scheme - we are talking about two interception
> scheme. Mine is passive one and yours is active one.
> 
> passive: check hwpoison pages at __generic_make_request()/elv_next_request() 
>          (the code will be enabled by an mce_bad_io_pages counter)

That's not a feasible approach at all, it'll add O(N) scan of a bio at
queue time. Ditto for the elv_next_request() approach.

What would be cheaper is to check the pages at dma map time, since you
have to scan the request anyway. That means putting it in
blk_rq_map_sg() or similar.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-03 10:21                         ` Jens Axboe
  0 siblings, 0 replies; 232+ messages in thread
From: Jens Axboe @ 2009-06-03 10:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 02 2009, Wu Fengguang wrote:
> > And then this is possible because you aren't violating mm
> > assumptions due to 1b. This proceeds just as the existing
> > pagecache mce error handler case which exists now.
> 
> Yeah that's a good scheme - we are talking about two interception
> scheme. Mine is passive one and yours is active one.
> 
> passive: check hwpoison pages at __generic_make_request()/elv_next_request() 
>          (the code will be enabled by an mce_bad_io_pages counter)

That's not a feasible approach at all, it'll add O(N) scan of a bio at
queue time. Ditto for the elv_next_request() approach.

What would be cheaper is to check the pages at dma map time, since you
have to scan the request anyway. That means putting it in
blk_rq_map_sg() or similar.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-03  9:35                             ` Nick Piggin
@ 2009-06-03 11:24                               ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-03 11:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Wed, Jun 03, 2009 at 11:35:46AM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:27:15PM +0200, Andi Kleen wrote:
> > > Hmm, if you're handling buffercache here then possibly yes.
> > 
> > Good question, will check.
> 
> BTW. now that I think about it, buffercache is probably not a good
> idea to truncate (truncate, as-in: remove from pagecache). Because
> filesystems can assume that with just a reference on the page, then
> it will not be truncated.

Yes I understand. Need to check for this, but I'm not sure
how we can reliably detect it based on the struct page alone. I guess we have 
to look at the mapping.

> So I think it would be a good idea to exclude buffercache from
> here completely until it can be shown to be safe. Actually you

Agreed. Just need to figure out how.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-03 11:24                               ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-03 11:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Wu Fengguang, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Wed, Jun 03, 2009 at 11:35:46AM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:27:15PM +0200, Andi Kleen wrote:
> > > Hmm, if you're handling buffercache here then possibly yes.
> > 
> > Good question, will check.
> 
> BTW. now that I think about it, buffercache is probably not a good
> idea to truncate (truncate, as-in: remove from pagecache). Because
> filesystems can assume that with just a reference on the page, then
> it will not be truncated.

Yes I understand. Need to check for this, but I'm not sure
how we can reliably detect it based on the struct page alone. I guess we have 
to look at the mapping.

> So I think it would be a good idea to exclude buffercache from
> here completely until it can be shown to be safe. Actually you

Agreed. Just need to figure out how.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-01 18:32               ` Andi Kleen
@ 2009-06-03 15:51                 ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-03 15:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:32:25AM +0800, Andi Kleen wrote:
[snip]
> > > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > > in the known good data from disk.
> > > > 
> > > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > > swapcache anyway?
> > > 
> > > The ClearPageUptodate() is kind of a careless addition, in the hope
> > > that it will stop some random readers. Need more investigations.
> > 
> > OK. But it just muddies the waters in the meantime, so maybe take
> > such things out until there is a case for them.
> 
> It's gone

Andi, I'd recommend to re-add ClearPageUptodate() for dirty swap cache
pages. It will then make shmem_getpage() return EIO for
- shmem_fault()        => kill app with VM_FAULT_SIGBUS
- shmem_readpage()     => fail splice()/sendfile() etc.
- shmem_write_begin()  => fail splice()/sendfile() etc.
which is exactly what we wanted. Note that the EIO here is permanent.

I'll continue to do some experiments on its normal read/write behaviors.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-03 15:51                 ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-03 15:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, hugh, riel, akpm, chris.mason, linux-kernel, linux-mm

On Tue, Jun 02, 2009 at 02:32:25AM +0800, Andi Kleen wrote:
[snip]
> > > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > > in the known good data from disk.
> > > > 
> > > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > > swapcache anyway?
> > > 
> > > The ClearPageUptodate() is kind of a careless addition, in the hope
> > > that it will stop some random readers. Need more investigations.
> > 
> > OK. But it just muddies the waters in the meantime, so maybe take
> > such things out until there is a case for them.
> 
> It's gone

Andi, I'd recommend to re-add ClearPageUptodate() for dirty swap cache
pages. It will then make shmem_getpage() return EIO for
- shmem_fault()        => kill app with VM_FAULT_SIGBUS
- shmem_readpage()     => fail splice()/sendfile() etc.
- shmem_write_begin()  => fail splice()/sendfile() etc.
which is exactly what we wanted. Note that the EIO here is permanent.

I'll continue to do some experiments on its normal read/write behaviors.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
  2009-06-03 15:51                 ` Wu Fengguang
@ 2009-06-03 16:05                   ` Andi Kleen
  -1 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-03 16:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Wed, Jun 03, 2009 at 11:51:33PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 02:32:25AM +0800, Andi Kleen wrote:
> [snip]
> > > > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > > > in the known good data from disk.
> > > > > 
> > > > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > > > swapcache anyway?
> > > > 
> > > > The ClearPageUptodate() is kind of a careless addition, in the hope
> > > > that it will stop some random readers. Need more investigations.
> > > 
> > > OK. But it just muddies the waters in the meantime, so maybe take
> > > such things out until there is a case for them.
> > 
> > It's gone
> 
> Andi, I'd recommend to re-add ClearPageUptodate() for dirty swap cache
> pages. It will then make shmem_getpage() return EIO for
> - shmem_fault()        => kill app with VM_FAULT_SIGBUS
> - shmem_readpage()     => fail splice()/sendfile() etc.
> - shmem_write_begin()  => fail splice()/sendfile() etc.
> which is exactly what we wanted. Note that the EIO here is permanent.

Done.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-03 16:05                   ` Andi Kleen
  0 siblings, 0 replies; 232+ messages in thread
From: Andi Kleen @ 2009-06-03 16:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Wed, Jun 03, 2009 at 11:51:33PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 02:32:25AM +0800, Andi Kleen wrote:
> [snip]
> > > > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > > > in the known good data from disk.
> > > > > 
> > > > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > > > swapcache anyway?
> > > > 
> > > > The ClearPageUptodate() is kind of a careless addition, in the hope
> > > > that it will stop some random readers. Need more investigations.
> > > 
> > > OK. But it just muddies the waters in the meantime, so maybe take
> > > such things out until there is a case for them.
> > 
> > It's gone
> 
> Andi, I'd recommend to re-add ClearPageUptodate() for dirty swap cache
> pages. It will then make shmem_getpage() return EIO for
> - shmem_fault()        => kill app with VM_FAULT_SIGBUS
> - shmem_readpage()     => fail splice()/sendfile() etc.
> - shmem_write_begin()  => fail splice()/sendfile() etc.
> which is exactly what we wanted. Note that the EIO here is permanent.

Done.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-05-28 14:50             ` Wu Fengguang
@ 2009-06-04  6:25               ` Nai Xia
  -1 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-04  6:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>
> [snip]
>
>> >
>> > BTW. I don't know if you are checking for PG_writeback often enough?
>> > You can't remove a PG_writeback page from pagecache. The normal
>> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>>
>> So pages can be in writeback without being locked? I still
>> wasn't able to find such a case (in fact unless I'm misreading
>> the code badly the writeback bit is only used by NFS and a few
>> obscure cases)
>
> Yes the writeback page is typically not locked. Only read IO requires
> to be exclusive. Read IO is in fact page *writer*, while writeback IO
> is page *reader* :-)

Sorry for maybe somewhat a little bit off topic,
I am trying to get a good understanding of PG_writeback & PG_locked ;)

So you are saying PG_writeback & PG_locked are acting like a read/write lock?
I notice wait_on_page_writeback(page) seems always called with page locked --
that is the semantics of a writer waiting to get the lock while it's
acquired by
some reader:The caller(e.g. truncate_inode_pages_range()  and
invalidate_inode_pages2_range()) are the writers waiting for
writeback readers (as you clarified ) to finish their job, right ?

So do you think the idea is sane to group the two bits together
to form a real read/write lock, which does not care about the _number_
of readers ?


>
> The writeback bit is _widely_ used.  test_set_page_writeback() is
> directly used by NFS/AFS etc. But its main user is in fact
> set_page_writeback(), which is called in 26 places.
>
>> > think would be safest
>>
>> Okay. I'll just add it after the page lock.
>>
>> > (then you never have to bother with the writeback bit again)
>>
>> Until Fengguang does something fancy with it.
>
> Yes I'm going to do it without wait_on_page_writeback().
>
> The reason truncate_inode_pages_range() has to wait on writeback page
> is to ensure data integrity. Otherwise if there comes two events:
>        truncate page A at offset X
>        populate page B at offset X
> If A and B are all writeback pages, then B can hit disk first and then
> be overwritten by A. Which corrupts the data at offset X from user's POV.
>
> But for hwpoison, there are no such worries. If A is poisoned, we do
> our best to isolate it as well as intercepting its IO. If the interception
> fails, it will trigger another machine check before hitting the disk.
>
> After all, poisoned A means the data at offset X is already corrupted.
> It doesn't matter if there comes another B page.
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-04  6:25               ` Nai Xia
  0 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-04  6:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>
> [snip]
>
>> >
>> > BTW. I don't know if you are checking for PG_writeback often enough?
>> > You can't remove a PG_writeback page from pagecache. The normal
>> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>>
>> So pages can be in writeback without being locked? I still
>> wasn't able to find such a case (in fact unless I'm misreading
>> the code badly the writeback bit is only used by NFS and a few
>> obscure cases)
>
> Yes the writeback page is typically not locked. Only read IO requires
> to be exclusive. Read IO is in fact page *writer*, while writeback IO
> is page *reader* :-)

Sorry for maybe somewhat a little bit off topic,
I am trying to get a good understanding of PG_writeback & PG_locked ;)

So you are saying PG_writeback & PG_locked are acting like a read/write lock?
I notice wait_on_page_writeback(page) seems always called with page locked --
that is the semantics of a writer waiting to get the lock while it's
acquired by
some reader:The caller(e.g. truncate_inode_pages_range()  and
invalidate_inode_pages2_range()) are the writers waiting for
writeback readers (as you clarified ) to finish their job, right ?

So do you think the idea is sane to group the two bits together
to form a real read/write lock, which does not care about the _number_
of readers ?


>
> The writeback bit is _widely_ used.  test_set_page_writeback() is
> directly used by NFS/AFS etc. But its main user is in fact
> set_page_writeback(), which is called in 26 places.
>
>> > think would be safest
>>
>> Okay. I'll just add it after the page lock.
>>
>> > (then you never have to bother with the writeback bit again)
>>
>> Until Fengguang does something fancy with it.
>
> Yes I'm going to do it without wait_on_page_writeback().
>
> The reason truncate_inode_pages_range() has to wait on writeback page
> is to ensure data integrity. Otherwise if there comes two events:
>        truncate page A at offset X
>        populate page B at offset X
> If A and B are all writeback pages, then B can hit disk first and then
> be overwritten by A. Which corrupts the data at offset X from user's POV.
>
> But for hwpoison, there are no such worries. If A is poisoned, we do
> our best to isolate it as well as intercepting its IO. If the interception
> fails, it will trigger another machine check before hitting the disk.
>
> After all, poisoned A means the data at offset X is already corrupted.
> It doesn't matter if there comes another B page.
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-04  6:25               ` Nai Xia
@ 2009-06-07 16:02                 ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-07 16:02 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >
> > [snip]
> >
> >> >
> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >>
> >> So pages can be in writeback without being locked? I still
> >> wasn't able to find such a case (in fact unless I'm misreading
> >> the code badly the writeback bit is only used by NFS and a few
> >> obscure cases)
> >
> > Yes the writeback page is typically not locked. Only read IO requires
> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> > is page *reader* :-)
> 
> Sorry for maybe somewhat a little bit off topic,
> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> 
> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> I notice wait_on_page_writeback(page) seems always called with page locked --

No. Note that pages are not locked in wait_on_page_writeback_range().

> that is the semantics of a writer waiting to get the lock while it's
> acquired by
> some reader:The caller(e.g. truncate_inode_pages_range()  and
> invalidate_inode_pages2_range()) are the writers waiting for
> writeback readers (as you clarified ) to finish their job, right ?

Sorry if my metaphor confused you. But they are not typical
reader/writer problems, but more about data integrities.

Pages have to be "not under writeback" when truncated. 
Otherwise data lost is possible:

1) create a file with one page (page A)
2) truncate page A that is under writeback
3) write to file, which creates page B
4) sync file, which sends page B to disk quickly

Now if page B reaches disk before A, the new data will be overwritten
by truncated old data, which corrupts the file.

> So do you think the idea is sane to group the two bits together
> to form a real read/write lock, which does not care about the _number_
> of readers ?

We don't care number of readers here. So please forget about it.

Thanks,
Fengguang

> > The writeback bit is _widely_ used.  test_set_page_writeback() is
> > directly used by NFS/AFS etc. But its main user is in fact
> > set_page_writeback(), which is called in 26 places.
> >
> >> > think would be safest
> >>
> >> Okay. I'll just add it after the page lock.
> >>
> >> > (then you never have to bother with the writeback bit again)
> >>
> >> Until Fengguang does something fancy with it.
> >
> > Yes I'm going to do it without wait_on_page_writeback().
> >
> > The reason truncate_inode_pages_range() has to wait on writeback page
> > is to ensure data integrity. Otherwise if there comes two events:
> >        truncate page A at offset X
> >        populate page B at offset X
> > If A and B are all writeback pages, then B can hit disk first and then
> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >
> > But for hwpoison, there are no such worries. If A is poisoned, we do
> > our best to isolate it as well as intercepting its IO. If the interception
> > fails, it will trigger another machine check before hitting the disk.
> >
> > After all, poisoned A means the data at offset X is already corrupted.
> > It doesn't matter if there comes another B page.
> >
> > Thanks,
> > Fengguang
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-07 16:02                 ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-07 16:02 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >
> > [snip]
> >
> >> >
> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >>
> >> So pages can be in writeback without being locked? I still
> >> wasn't able to find such a case (in fact unless I'm misreading
> >> the code badly the writeback bit is only used by NFS and a few
> >> obscure cases)
> >
> > Yes the writeback page is typically not locked. Only read IO requires
> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> > is page *reader* :-)
> 
> Sorry for maybe somewhat a little bit off topic,
> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> 
> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> I notice wait_on_page_writeback(page) seems always called with page locked --

No. Note that pages are not locked in wait_on_page_writeback_range().

> that is the semantics of a writer waiting to get the lock while it's
> acquired by
> some reader:The caller(e.g. truncate_inode_pages_range()  and
> invalidate_inode_pages2_range()) are the writers waiting for
> writeback readers (as you clarified ) to finish their job, right ?

Sorry if my metaphor confused you. But they are not typical
reader/writer problems, but more about data integrities.

Pages have to be "not under writeback" when truncated. 
Otherwise data lost is possible:

1) create a file with one page (page A)
2) truncate page A that is under writeback
3) write to file, which creates page B
4) sync file, which sends page B to disk quickly

Now if page B reaches disk before A, the new data will be overwritten
by truncated old data, which corrupts the file.

> So do you think the idea is sane to group the two bits together
> to form a real read/write lock, which does not care about the _number_
> of readers ?

We don't care number of readers here. So please forget about it.

Thanks,
Fengguang

> > The writeback bit is _widely_ used. A test_set_page_writeback() is
> > directly used by NFS/AFS etc. But its main user is in fact
> > set_page_writeback(), which is called in 26 places.
> >
> >> > think would be safest
> >>
> >> Okay. I'll just add it after the page lock.
> >>
> >> > (then you never have to bother with the writeback bit again)
> >>
> >> Until Fengguang does something fancy with it.
> >
> > Yes I'm going to do it without wait_on_page_writeback().
> >
> > The reason truncate_inode_pages_range() has to wait on writeback page
> > is to ensure data integrity. Otherwise if there comes two events:
> > A  A  A  A truncate page A at offset X
> > A  A  A  A populate page B at offset X
> > If A and B are all writeback pages, then B can hit disk first and then
> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >
> > But for hwpoison, there are no such worries. If A is poisoned, we do
> > our best to isolate it as well as intercepting its IO. If the interception
> > fails, it will trigger another machine check before hitting the disk.
> >
> > After all, poisoned A means the data at offset X is already corrupted.
> > It doesn't matter if there comes another B page.
> >
> > Thanks,
> > Fengguang
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at A http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at A http://www.tux.org/lkml/
> >

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-07 16:02                 ` Wu Fengguang
@ 2009-06-08 11:06                   ` Nai Xia
  -1 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-08 11:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
>> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>> >
>> > [snip]
>> >
>> >> >
>> >> > BTW. I don't know if you are checking for PG_writeback often enough?
>> >> > You can't remove a PG_writeback page from pagecache. The normal
>> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>> >>
>> >> So pages can be in writeback without being locked? I still
>> >> wasn't able to find such a case (in fact unless I'm misreading
>> >> the code badly the writeback bit is only used by NFS and a few
>> >> obscure cases)
>> >
>> > Yes the writeback page is typically not locked. Only read IO requires
>> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
>> > is page *reader* :-)
>>
>> Sorry for maybe somewhat a little bit off topic,
>> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>>
>> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
>> I notice wait_on_page_writeback(page) seems always called with page locked --
>
> No. Note that pages are not locked in wait_on_page_writeback_range().

I see. This function seems mostly called  on the sync path,
it just waits for data being synchronized to disk.
No writers from the pages' POV, so no lock.
I missed this case, but my argument about the role of read/write lock.
seems still consistent. :)

>
>> that is the semantics of a writer waiting to get the lock while it's
>> acquired by
>> some reader:The caller(e.g. truncate_inode_pages_range()  and
>> invalidate_inode_pages2_range()) are the writers waiting for
>> writeback readers (as you clarified ) to finish their job, right ?
>
> Sorry if my metaphor confused you. But they are not typical
> reader/writer problems, but more about data integrities.

No, you didn't :)
Actually, you make me clear about the mixed roles for
those bits.

>
> Pages have to be "not under writeback" when truncated.
> Otherwise data lost is possible:
>
> 1) create a file with one page (page A)
> 2) truncate page A that is under writeback
> 3) write to file, which creates page B
> 4) sync file, which sends page B to disk quickly
>
> Now if page B reaches disk before A, the new data will be overwritten
> by truncated old data, which corrupts the file.

I fully understand this scenario which you had already clarified in a
previous message. :)

1. someone make index1-> page A
2. Path P1 is acting as a *reader* to a cache page at index1 by
    setting PG_writeback on, while at the same time as a *writer* to
    the corresponding file blocks.
3. Another path P2 comes in and  truncate page A, he is the writer
    to the same cache page.
4. Yet another path P3 comes  as the writer to the cache page
     making it points to page B: index1--> page B.
5. Path P4 comes writing back the cache page(and set PG_writeback).
   He is the reader of the cache page and the writer to the file blocks.

The corrupts occur because P1 & P4 races when writing file blocks.
But the _root_ of this racing is because nothing is used to serialize
them on the side of writing the file blocks and above stream reading was
inconsistent because of the writers(P2 & P3) to cache page at index1.

Note that the "sync file" is somewhat irrelevant, even without "sync file",
the racing still may exists. I know you must want to show me that this could
make the corruption more easy to occur.

So I think the simple logic is:
1) if you want to truncate/change the mapping from a index to a struct *page,
test writeback bit because the writebacker to the file blocks is the reader
of this mapping.
2) if a writebacker want to start a read of this mapping with
test_set_page_writeback()
or set_page_writeback(), he'd be sure this page is locked to keep out the
writers to this mapping of index-->struct *page.

This is really behavior of a read/write lock, right ?

wait_on_page_writeback_range() looks different only because "sync"
operates on "struct page", it's not sensitive to index-->struct *page mapping.
It does care about if pages returned by pagevec_lookup_tag() are
still maintains the mapping when wait_on_page_writeback(page).
Here, PG_writeback is only a status flag for "struct page" not a lock bit for
index->struct *page mapping.

>
>> So do you think the idea is sane to group the two bits together
>> to form a real read/write lock, which does not care about the _number_
>> of readers ?
>
> We don't care number of readers here. So please forget about it.
Yeah, I meant number of readers is not important.

I still hold that these two bits in some way act like a _sparse_
read/write lock.
But I am going to drop the idea of making them a pure lock, since PG_writeback
does has other meaning -- the page is being writing back: for sync
path, it's only
a status flag.
Making a pure read/write lock definitely will lose that or at least distort it.


Hoping I've made my words understandable, correct me if wrong, and
many thanks for your time and patience. :-)


Nai Xia

>
> Thanks,
> Fengguang
>
>> > The writeback bit is _widely_ used.  test_set_page_writeback() is
>> > directly used by NFS/AFS etc. But its main user is in fact
>> > set_page_writeback(), which is called in 26 places.
>> >
>> >> > think would be safest
>> >>
>> >> Okay. I'll just add it after the page lock.
>> >>
>> >> > (then you never have to bother with the writeback bit again)
>> >>
>> >> Until Fengguang does something fancy with it.
>> >
>> > Yes I'm going to do it without wait_on_page_writeback().
>> >
>> > The reason truncate_inode_pages_range() has to wait on writeback page
>> > is to ensure data integrity. Otherwise if there comes two events:
>> >        truncate page A at offset X
>> >        populate page B at offset X
>> > If A and B are all writeback pages, then B can hit disk first and then
>> > be overwritten by A. Which corrupts the data at offset X from user's POV.
>> >
>> > But for hwpoison, there are no such worries. If A is poisoned, we do
>> > our best to isolate it as well as intercepting its IO. If the interception
>> > fails, it will trigger another machine check before hitting the disk.
>> >
>> > After all, poisoned A means the data at offset X is already corrupted.
>> > It doesn't matter if there comes another B page.
>> >
>> > Thanks,
>> > Fengguang
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>> >
>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-08 11:06                   ` Nai Xia
  0 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-08 11:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
>> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>> >
>> > [snip]
>> >
>> >> >
>> >> > BTW. I don't know if you are checking for PG_writeback often enough?
>> >> > You can't remove a PG_writeback page from pagecache. The normal
>> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>> >>
>> >> So pages can be in writeback without being locked? I still
>> >> wasn't able to find such a case (in fact unless I'm misreading
>> >> the code badly the writeback bit is only used by NFS and a few
>> >> obscure cases)
>> >
>> > Yes the writeback page is typically not locked. Only read IO requires
>> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
>> > is page *reader* :-)
>>
>> Sorry for maybe somewhat a little bit off topic,
>> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>>
>> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
>> I notice wait_on_page_writeback(page) seems always called with page locked --
>
> No. Note that pages are not locked in wait_on_page_writeback_range().

I see. This function seems mostly called  on the sync path,
it just waits for data being synchronized to disk.
No writers from the pages' POV, so no lock.
I missed this case, but my argument about the role of read/write lock.
seems still consistent. :)

>
>> that is the semantics of a writer waiting to get the lock while it's
>> acquired by
>> some reader:The caller(e.g. truncate_inode_pages_range()  and
>> invalidate_inode_pages2_range()) are the writers waiting for
>> writeback readers (as you clarified ) to finish their job, right ?
>
> Sorry if my metaphor confused you. But they are not typical
> reader/writer problems, but more about data integrities.

No, you didn't :)
Actually, you make me clear about the mixed roles for
those bits.

>
> Pages have to be "not under writeback" when truncated.
> Otherwise data lost is possible:
>
> 1) create a file with one page (page A)
> 2) truncate page A that is under writeback
> 3) write to file, which creates page B
> 4) sync file, which sends page B to disk quickly
>
> Now if page B reaches disk before A, the new data will be overwritten
> by truncated old data, which corrupts the file.

I fully understand this scenario which you had already clarified in a
previous message. :)

1. someone make index1-> page A
2. Path P1 is acting as a *reader* to a cache page at index1 by
    setting PG_writeback on, while at the same time as a *writer* to
    the corresponding file blocks.
3. Another path P2 comes in and  truncate page A, he is the writer
    to the same cache page.
4. Yet another path P3 comes  as the writer to the cache page
     making it points to page B: index1--> page B.
5. Path P4 comes writing back the cache page(and set PG_writeback).
   He is the reader of the cache page and the writer to the file blocks.

The corrupts occur because P1 & P4 races when writing file blocks.
But the _root_ of this racing is because nothing is used to serialize
them on the side of writing the file blocks and above stream reading was
inconsistent because of the writers(P2 & P3) to cache page at index1.

Note that the "sync file" is somewhat irrelevant, even without "sync file",
the racing still may exists. I know you must want to show me that this could
make the corruption more easy to occur.

So I think the simple logic is:
1) if you want to truncate/change the mapping from a index to a struct *page,
test writeback bit because the writebacker to the file blocks is the reader
of this mapping.
2) if a writebacker want to start a read of this mapping with
test_set_page_writeback()
or set_page_writeback(), he'd be sure this page is locked to keep out the
writers to this mapping of index-->struct *page.

This is really behavior of a read/write lock, right ?

wait_on_page_writeback_range() looks different only because "sync"
operates on "struct page", it's not sensitive to index-->struct *page mapping.
It does care about if pages returned by pagevec_lookup_tag() are
still maintains the mapping when wait_on_page_writeback(page).
Here, PG_writeback is only a status flag for "struct page" not a lock bit for
index->struct *page mapping.

>
>> So do you think the idea is sane to group the two bits together
>> to form a real read/write lock, which does not care about the _number_
>> of readers ?
>
> We don't care number of readers here. So please forget about it.
Yeah, I meant number of readers is not important.

I still hold that these two bits in some way act like a _sparse_
read/write lock.
But I am going to drop the idea of making them a pure lock, since PG_writeback
does has other meaning -- the page is being writing back: for sync
path, it's only
a status flag.
Making a pure read/write lock definitely will lose that or at least distort it.


Hoping I've made my words understandable, correct me if wrong, and
many thanks for your time and patience. :-)


Nai Xia

>
> Thanks,
> Fengguang
>
>> > The writeback bit is _widely_ used.  test_set_page_writeback() is
>> > directly used by NFS/AFS etc. But its main user is in fact
>> > set_page_writeback(), which is called in 26 places.
>> >
>> >> > think would be safest
>> >>
>> >> Okay. I'll just add it after the page lock.
>> >>
>> >> > (then you never have to bother with the writeback bit again)
>> >>
>> >> Until Fengguang does something fancy with it.
>> >
>> > Yes I'm going to do it without wait_on_page_writeback().
>> >
>> > The reason truncate_inode_pages_range() has to wait on writeback page
>> > is to ensure data integrity. Otherwise if there comes two events:
>> >        truncate page A at offset X
>> >        populate page B at offset X
>> > If A and B are all writeback pages, then B can hit disk first and then
>> > be overwritten by A. Which corrupts the data at offset X from user's POV.
>> >
>> > But for hwpoison, there are no such worries. If A is poisoned, we do
>> > our best to isolate it as well as intercepting its IO. If the interception
>> > fails, it will trigger another machine check before hitting the disk.
>> >
>> > After all, poisoned A means the data at offset X is already corrupted.
>> > It doesn't matter if there comes another B page.
>> >
>> > Thanks,
>> > Fengguang
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>> >
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-08 11:06                   ` Nai Xia
@ 2009-06-08 12:31                     ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-08 12:31 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >
> >> > [snip]
> >> >
> >> >> >
> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >>
> >> >> So pages can be in writeback without being locked? I still
> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> obscure cases)
> >> >
> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> > is page *reader* :-)
> >>
> >> Sorry for maybe somewhat a little bit off topic,
> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >>
> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >
> > No. Note that pages are not locked in wait_on_page_writeback_range().
> 
> I see. This function seems mostly called  on the sync path,
> it just waits for data being synchronized to disk.
> No writers from the pages' POV, so no lock.
> I missed this case, but my argument about the role of read/write lock.
> seems still consistent. :)

It's more constrained. Normal read/write locks allow concurrent readers,
however fsync() must wait for previous IO to finish before starting
its own IO.

> >
> >> that is the semantics of a writer waiting to get the lock while it's
> >> acquired by
> >> some reader:The caller(e.g. truncate_inode_pages_range()  and
> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> writeback readers (as you clarified ) to finish their job, right ?
> >
> > Sorry if my metaphor confused you. But they are not typical
> > reader/writer problems, but more about data integrities.
> 
> No, you didn't :)
> Actually, you make me clear about the mixed roles for
> those bits.
> 
> >
> > Pages have to be "not under writeback" when truncated.
> > Otherwise data lost is possible:
> >
> > 1) create a file with one page (page A)
> > 2) truncate page A that is under writeback
> > 3) write to file, which creates page B
> > 4) sync file, which sends page B to disk quickly
> >
> > Now if page B reaches disk before A, the new data will be overwritten
> > by truncated old data, which corrupts the file.
> 
> I fully understand this scenario which you had already clarified in a
> previous message. :)
> 
> 1. someone make index1-> page A
> 2. Path P1 is acting as a *reader* to a cache page at index1 by
>     setting PG_writeback on, while at the same time as a *writer* to
>     the corresponding file blocks.
> 3. Another path P2 comes in and  truncate page A, he is the writer
>     to the same cache page.
> 4. Yet another path P3 comes  as the writer to the cache page
>      making it points to page B: index1--> page B.
> 5. Path P4 comes writing back the cache page(and set PG_writeback).
>    He is the reader of the cache page and the writer to the file blocks.
> 
> The corrupts occur because P1 & P4 races when writing file blocks.
> But the _root_ of this racing is because nothing is used to serialize
> them on the side of writing the file blocks and above stream reading was
> inconsistent because of the writers(P2 & P3) to cache page at index1.
>
> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> the racing still may exists. I know you must want to show me that this could
> make the corruption more easy to occur.
>
> So I think the simple logic is:
> 1) if you want to truncate/change the mapping from a index to a struct *page,
> test writeback bit because the writebacker to the file blocks is the reader
> of this mapping.
> 2) if a writebacker want to start a read of this mapping with
> test_set_page_writeback()
> or set_page_writeback(), he'd be sure this page is locked to keep out the
> writers to this mapping of index-->struct *page.
> 
> This is really behavior of a read/write lock, right ?

Please, that's a dangerous idea. A page can be written to at any time
when writeback to disk is under way. Does PG_writeback (your reader
lock) prevents page data writers?  NO.

Thanks,
Fengguang

> wait_on_page_writeback_range() looks different only because "sync"
> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> It does care about if pages returned by pagevec_lookup_tag() are
> still maintains the mapping when wait_on_page_writeback(page).
> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> index->struct *page mapping.
> 
> >
> >> So do you think the idea is sane to group the two bits together
> >> to form a real read/write lock, which does not care about the _number_
> >> of readers ?
> >
> > We don't care number of readers here. So please forget about it.
> Yeah, I meant number of readers is not important.
> 
> I still hold that these two bits in some way act like a _sparse_
> read/write lock.
> But I am going to drop the idea of making them a pure lock, since PG_writeback
> does has other meaning -- the page is being writing back: for sync
> path, it's only
> a status flag.
> Making a pure read/write lock definitely will lose that or at least distort it.
> 
> 
> Hoping I've made my words understandable, correct me if wrong, and
> many thanks for your time and patience. :-)
> 
> 
> Nai Xia
> 
> >
> > Thanks,
> > Fengguang
> >
> >> > The writeback bit is _widely_ used.  test_set_page_writeback() is
> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> > set_page_writeback(), which is called in 26 places.
> >> >
> >> >> > think would be safest
> >> >>
> >> >> Okay. I'll just add it after the page lock.
> >> >>
> >> >> > (then you never have to bother with the writeback bit again)
> >> >>
> >> >> Until Fengguang does something fancy with it.
> >> >
> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >
> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> >        truncate page A at offset X
> >> >        populate page B at offset X
> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >
> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> > fails, it will trigger another machine check before hitting the disk.
> >> >
> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> > It doesn't matter if there comes another B page.
> >> >
> >> > Thanks,
> >> > Fengguang
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >
> >

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-08 12:31                     ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-08 12:31 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >
> >> > [snip]
> >> >
> >> >> >
> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >>
> >> >> So pages can be in writeback without being locked? I still
> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> obscure cases)
> >> >
> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> > is page *reader* :-)
> >>
> >> Sorry for maybe somewhat a little bit off topic,
> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >>
> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >
> > No. Note that pages are not locked in wait_on_page_writeback_range().
> 
> I see. This function seems mostly called  on the sync path,
> it just waits for data being synchronized to disk.
> No writers from the pages' POV, so no lock.
> I missed this case, but my argument about the role of read/write lock.
> seems still consistent. :)

It's more constrained. Normal read/write locks allow concurrent readers,
however fsync() must wait for previous IO to finish before starting
its own IO.

> >
> >> that is the semantics of a writer waiting to get the lock while it's
> >> acquired by
> >> some reader:The caller(e.g. truncate_inode_pages_range() A and
> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> writeback readers (as you clarified ) to finish their job, right ?
> >
> > Sorry if my metaphor confused you. But they are not typical
> > reader/writer problems, but more about data integrities.
> 
> No, you didn't :)
> Actually, you make me clear about the mixed roles for
> those bits.
> 
> >
> > Pages have to be "not under writeback" when truncated.
> > Otherwise data lost is possible:
> >
> > 1) create a file with one page (page A)
> > 2) truncate page A that is under writeback
> > 3) write to file, which creates page B
> > 4) sync file, which sends page B to disk quickly
> >
> > Now if page B reaches disk before A, the new data will be overwritten
> > by truncated old data, which corrupts the file.
> 
> I fully understand this scenario which you had already clarified in a
> previous message. :)
> 
> 1. someone make index1-> page A
> 2. Path P1 is acting as a *reader* to a cache page at index1 by
>     setting PG_writeback on, while at the same time as a *writer* to
>     the corresponding file blocks.
> 3. Another path P2 comes in and  truncate page A, he is the writer
>     to the same cache page.
> 4. Yet another path P3 comes  as the writer to the cache page
>      making it points to page B: index1--> page B.
> 5. Path P4 comes writing back the cache page(and set PG_writeback).
>    He is the reader of the cache page and the writer to the file blocks.
> 
> The corrupts occur because P1 & P4 races when writing file blocks.
> But the _root_ of this racing is because nothing is used to serialize
> them on the side of writing the file blocks and above stream reading was
> inconsistent because of the writers(P2 & P3) to cache page at index1.
>
> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> the racing still may exists. I know you must want to show me that this could
> make the corruption more easy to occur.
>
> So I think the simple logic is:
> 1) if you want to truncate/change the mapping from a index to a struct *page,
> test writeback bit because the writebacker to the file blocks is the reader
> of this mapping.
> 2) if a writebacker want to start a read of this mapping with
> test_set_page_writeback()
> or set_page_writeback(), he'd be sure this page is locked to keep out the
> writers to this mapping of index-->struct *page.
> 
> This is really behavior of a read/write lock, right ?

Please, that's a dangerous idea. A page can be written to at any time
when writeback to disk is under way. Does PG_writeback (your reader
lock) prevents page data writers?  NO.

Thanks,
Fengguang

> wait_on_page_writeback_range() looks different only because "sync"
> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> It does care about if pages returned by pagevec_lookup_tag() are
> still maintains the mapping when wait_on_page_writeback(page).
> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> index->struct *page mapping.
> 
> >
> >> So do you think the idea is sane to group the two bits together
> >> to form a real read/write lock, which does not care about the _number_
> >> of readers ?
> >
> > We don't care number of readers here. So please forget about it.
> Yeah, I meant number of readers is not important.
> 
> I still hold that these two bits in some way act like a _sparse_
> read/write lock.
> But I am going to drop the idea of making them a pure lock, since PG_writeback
> does has other meaning -- the page is being writing back: for sync
> path, it's only
> a status flag.
> Making a pure read/write lock definitely will lose that or at least distort it.
> 
> 
> Hoping I've made my words understandable, correct me if wrong, and
> many thanks for your time and patience. :-)
> 
> 
> Nai Xia
> 
> >
> > Thanks,
> > Fengguang
> >
> >> > The writeback bit is _widely_ used. A test_set_page_writeback() is
> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> > set_page_writeback(), which is called in 26 places.
> >> >
> >> >> > think would be safest
> >> >>
> >> >> Okay. I'll just add it after the page lock.
> >> >>
> >> >> > (then you never have to bother with the writeback bit again)
> >> >>
> >> >> Until Fengguang does something fancy with it.
> >> >
> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >
> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> > A  A  A  A truncate page A at offset X
> >> > A  A  A  A populate page B at offset X
> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >
> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> > fails, it will trigger another machine check before hitting the disk.
> >> >
> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> > It doesn't matter if there comes another B page.
> >> >
> >> > Thanks,
> >> > Fengguang
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at A http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at A http://www.tux.org/lkml/
> >> >
> >

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-08 12:31                     ` Wu Fengguang
@ 2009-06-08 14:46                       ` Nai Xia
  -1 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-08 14:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
>> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
>> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>> >> >
>> >> > [snip]
>> >> >
>> >> >> >
>> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
>> >> >> > You can't remove a PG_writeback page from pagecache. The normal
>> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>> >> >>
>> >> >> So pages can be in writeback without being locked? I still
>> >> >> wasn't able to find such a case (in fact unless I'm misreading
>> >> >> the code badly the writeback bit is only used by NFS and a few
>> >> >> obscure cases)
>> >> >
>> >> > Yes the writeback page is typically not locked. Only read IO requires
>> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
>> >> > is page *reader* :-)
>> >>
>> >> Sorry for maybe somewhat a little bit off topic,
>> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>> >>
>> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
>> >> I notice wait_on_page_writeback(page) seems always called with page locked --
>> >
>> > No. Note that pages are not locked in wait_on_page_writeback_range().
>>
>> I see. This function seems mostly called  on the sync path,
>> it just waits for data being synchronized to disk.
>> No writers from the pages' POV, so no lock.
>> I missed this case, but my argument about the role of read/write lock.
>> seems still consistent. :)
>
> It's more constrained. Normal read/write locks allow concurrent readers,
> however fsync() must wait for previous IO to finish before starting
> its own IO.

Oh, yes, this is what I called "mixed roles". One for lock, one for
status flag, twisted together in the same path, making the read lock
semantics totally broken.

>
>> >
>> >> that is the semantics of a writer waiting to get the lock while it's
>> >> acquired by
>> >> some reader:The caller(e.g. truncate_inode_pages_range()  and
>> >> invalidate_inode_pages2_range()) are the writers waiting for
>> >> writeback readers (as you clarified ) to finish their job, right ?
>> >
>> > Sorry if my metaphor confused you. But they are not typical
>> > reader/writer problems, but more about data integrities.
>>
>> No, you didn't :)
>> Actually, you make me clear about the mixed roles for
>> those bits.
>>
>> >
>> > Pages have to be "not under writeback" when truncated.
>> > Otherwise data lost is possible:
>> >
>> > 1) create a file with one page (page A)
>> > 2) truncate page A that is under writeback
>> > 3) write to file, which creates page B
>> > 4) sync file, which sends page B to disk quickly
>> >
>> > Now if page B reaches disk before A, the new data will be overwritten
>> > by truncated old data, which corrupts the file.
>>
>> I fully understand this scenario which you had already clarified in a
>> previous message. :)
>>
>> 1. someone make index1-> page A
>> 2. Path P1 is acting as a *reader* to a cache page at index1 by
>>     setting PG_writeback on, while at the same time as a *writer* to
>>     the corresponding file blocks.
>> 3. Another path P2 comes in and  truncate page A, he is the writer
>>     to the same cache page.
>> 4. Yet another path P3 comes  as the writer to the cache page
>>      making it points to page B: index1--> page B.
>> 5. Path P4 comes writing back the cache page(and set PG_writeback).
>>    He is the reader of the cache page and the writer to the file blocks.
>>
>> The corrupts occur because P1 & P4 races when writing file blocks.
>> But the _root_ of this racing is because nothing is used to serialize
>> them on the side of writing the file blocks and above stream reading was
>> inconsistent because of the writers(P2 & P3) to cache page at index1.
>>
>> Note that the "sync file" is somewhat irrelevant, even without "sync file",
>> the racing still may exists. I know you must want to show me that this could
>> make the corruption more easy to occur.
>>
>> So I think the simple logic is:
>> 1) if you want to truncate/change the mapping from a index to a struct *page,
>> test writeback bit because the writebacker to the file blocks is the reader
>> of this mapping.
>> 2) if a writebacker want to start a read of this mapping with
>> test_set_page_writeback()
>> or set_page_writeback(), he'd be sure this page is locked to keep out the
>> writers to this mapping of index-->struct *page.
>>
>> This is really behavior of a read/write lock, right ?
>
> Please, that's a dangerous idea. A page can be written to at any time
> when writeback to disk is under way. Does PG_writeback (your reader
> lock) prevents page data writers?  NO.

I meant PG_writeback stops writers to index---->struct page mapping.

I think I should make my statements more concise and the "reader/writer"
less vague.

Here we care about the write/read operation for index---->struct page mapping.
Not for read/write operation for the page content.

Anyone who wants to change this mapping  is a writer, he should take
page lock.
Anyone who wants to reference this mapping is a reader, writers should
wait for him. And when this reader wants to get ref, he should wait for
anyone one who is changing this mapping(e.g. page truncater).

When a path sets PG_writeback on a page, it need this index-->struct page
mapping be 100% valid right? (otherwise may leads to corruption.)
So writeback routines are readers of this index-->struct page mapping.
(oh, well if we can put the other role of PG_writeback aside)

Ok,Ok, since PG_locked does mean much more than just protecting
the per-page mapping which makes the lock abstraction even less clear.
so indeed, forget about it.

>
> Thanks,
> Fengguang
>
>> wait_on_page_writeback_range() looks different only because "sync"
>> operates on "struct page", it's not sensitive to index-->struct *page mapping.
>> It does care about if pages returned by pagevec_lookup_tag() are
>> still maintains the mapping when wait_on_page_writeback(page).
>> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
>> index->struct *page mapping.
>>
>> >
>> >> So do you think the idea is sane to group the two bits together
>> >> to form a real read/write lock, which does not care about the _number_
>> >> of readers ?
>> >
>> > We don't care number of readers here. So please forget about it.
>> Yeah, I meant number of readers is not important.
>>
>> I still hold that these two bits in some way act like a _sparse_
>> read/write lock.
>> But I am going to drop the idea of making them a pure lock, since PG_writeback
>> does has other meaning -- the page is being writing back: for sync
>> path, it's only
>> a status flag.
>> Making a pure read/write lock definitely will lose that or at least distort it.
>>
>>
>> Hoping I've made my words understandable, correct me if wrong, and
>> many thanks for your time and patience. :-)
>>
>>
>> Nai Xia
>>
>> >
>> > Thanks,
>> > Fengguang
>> >
>> >> > The writeback bit is _widely_ used.  test_set_page_writeback() is
>> >> > directly used by NFS/AFS etc. But its main user is in fact
>> >> > set_page_writeback(), which is called in 26 places.
>> >> >
>> >> >> > think would be safest
>> >> >>
>> >> >> Okay. I'll just add it after the page lock.
>> >> >>
>> >> >> > (then you never have to bother with the writeback bit again)
>> >> >>
>> >> >> Until Fengguang does something fancy with it.
>> >> >
>> >> > Yes I'm going to do it without wait_on_page_writeback().
>> >> >
>> >> > The reason truncate_inode_pages_range() has to wait on writeback page
>> >> > is to ensure data integrity. Otherwise if there comes two events:
>> >> >        truncate page A at offset X
>> >> >        populate page B at offset X
>> >> > If A and B are all writeback pages, then B can hit disk first and then
>> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
>> >> >
>> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
>> >> > our best to isolate it as well as intercepting its IO. If the interception
>> >> > fails, it will trigger another machine check before hitting the disk.
>> >> >
>> >> > After all, poisoned A means the data at offset X is already corrupted.
>> >> > It doesn't matter if there comes another B page.
>> >> >
>> >> > Thanks,
>> >> > Fengguang
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> > Please read the FAQ at  http://www.tux.org/lkml/
>> >> >
>> >
>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-08 14:46                       ` Nai Xia
  0 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-08 14:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
>> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
>> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
>> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
>> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
>> >> >
>> >> > [snip]
>> >> >
>> >> >> >
>> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
>> >> >> > You can't remove a PG_writeback page from pagecache. The normal
>> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
>> >> >>
>> >> >> So pages can be in writeback without being locked? I still
>> >> >> wasn't able to find such a case (in fact unless I'm misreading
>> >> >> the code badly the writeback bit is only used by NFS and a few
>> >> >> obscure cases)
>> >> >
>> >> > Yes the writeback page is typically not locked. Only read IO requires
>> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
>> >> > is page *reader* :-)
>> >>
>> >> Sorry for maybe somewhat a little bit off topic,
>> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
>> >>
>> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
>> >> I notice wait_on_page_writeback(page) seems always called with page locked --
>> >
>> > No. Note that pages are not locked in wait_on_page_writeback_range().
>>
>> I see. This function seems mostly called  on the sync path,
>> it just waits for data being synchronized to disk.
>> No writers from the pages' POV, so no lock.
>> I missed this case, but my argument about the role of read/write lock.
>> seems still consistent. :)
>
> It's more constrained. Normal read/write locks allow concurrent readers,
> however fsync() must wait for previous IO to finish before starting
> its own IO.

Oh, yes, this is what I called "mixed roles". One for lock, one for
status flag, twisted together in the same path, making the read lock
semantics totally broken.

>
>> >
>> >> that is the semantics of a writer waiting to get the lock while it's
>> >> acquired by
>> >> some reader:The caller(e.g. truncate_inode_pages_range()  and
>> >> invalidate_inode_pages2_range()) are the writers waiting for
>> >> writeback readers (as you clarified ) to finish their job, right ?
>> >
>> > Sorry if my metaphor confused you. But they are not typical
>> > reader/writer problems, but more about data integrities.
>>
>> No, you didn't :)
>> Actually, you make me clear about the mixed roles for
>> those bits.
>>
>> >
>> > Pages have to be "not under writeback" when truncated.
>> > Otherwise data lost is possible:
>> >
>> > 1) create a file with one page (page A)
>> > 2) truncate page A that is under writeback
>> > 3) write to file, which creates page B
>> > 4) sync file, which sends page B to disk quickly
>> >
>> > Now if page B reaches disk before A, the new data will be overwritten
>> > by truncated old data, which corrupts the file.
>>
>> I fully understand this scenario which you had already clarified in a
>> previous message. :)
>>
>> 1. someone make index1-> page A
>> 2. Path P1 is acting as a *reader* to a cache page at index1 by
>>     setting PG_writeback on, while at the same time as a *writer* to
>>     the corresponding file blocks.
>> 3. Another path P2 comes in and  truncate page A, he is the writer
>>     to the same cache page.
>> 4. Yet another path P3 comes  as the writer to the cache page
>>      making it points to page B: index1--> page B.
>> 5. Path P4 comes writing back the cache page(and set PG_writeback).
>>    He is the reader of the cache page and the writer to the file blocks.
>>
>> The corrupts occur because P1 & P4 races when writing file blocks.
>> But the _root_ of this racing is because nothing is used to serialize
>> them on the side of writing the file blocks and above stream reading was
>> inconsistent because of the writers(P2 & P3) to cache page at index1.
>>
>> Note that the "sync file" is somewhat irrelevant, even without "sync file",
>> the racing still may exists. I know you must want to show me that this could
>> make the corruption more easy to occur.
>>
>> So I think the simple logic is:
>> 1) if you want to truncate/change the mapping from a index to a struct *page,
>> test writeback bit because the writebacker to the file blocks is the reader
>> of this mapping.
>> 2) if a writebacker want to start a read of this mapping with
>> test_set_page_writeback()
>> or set_page_writeback(), he'd be sure this page is locked to keep out the
>> writers to this mapping of index-->struct *page.
>>
>> This is really behavior of a read/write lock, right ?
>
> Please, that's a dangerous idea. A page can be written to at any time
> when writeback to disk is under way. Does PG_writeback (your reader
> lock) prevents page data writers?  NO.

I meant PG_writeback stops writers to index---->struct page mapping.

I think I should make my statements more concise and the "reader/writer"
less vague.

Here we care about the write/read operation for index---->struct page mapping.
Not for read/write operation for the page content.

Anyone who wants to change this mapping  is a writer, he should take
page lock.
Anyone who wants to reference this mapping is a reader, writers should
wait for him. And when this reader wants to get ref, he should wait for
anyone one who is changing this mapping(e.g. page truncater).

When a path sets PG_writeback on a page, it need this index-->struct page
mapping be 100% valid right? (otherwise may leads to corruption.)
So writeback routines are readers of this index-->struct page mapping.
(oh, well if we can put the other role of PG_writeback aside)

Ok,Ok, since PG_locked does mean much more than just protecting
the per-page mapping which makes the lock abstraction even less clear.
so indeed, forget about it.

>
> Thanks,
> Fengguang
>
>> wait_on_page_writeback_range() looks different only because "sync"
>> operates on "struct page", it's not sensitive to index-->struct *page mapping.
>> It does care about if pages returned by pagevec_lookup_tag() are
>> still maintains the mapping when wait_on_page_writeback(page).
>> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
>> index->struct *page mapping.
>>
>> >
>> >> So do you think the idea is sane to group the two bits together
>> >> to form a real read/write lock, which does not care about the _number_
>> >> of readers ?
>> >
>> > We don't care number of readers here. So please forget about it.
>> Yeah, I meant number of readers is not important.
>>
>> I still hold that these two bits in some way act like a _sparse_
>> read/write lock.
>> But I am going to drop the idea of making them a pure lock, since PG_writeback
>> does has other meaning -- the page is being writing back: for sync
>> path, it's only
>> a status flag.
>> Making a pure read/write lock definitely will lose that or at least distort it.
>>
>>
>> Hoping I've made my words understandable, correct me if wrong, and
>> many thanks for your time and patience. :-)
>>
>>
>> Nai Xia
>>
>> >
>> > Thanks,
>> > Fengguang
>> >
>> >> > The writeback bit is _widely_ used.  test_set_page_writeback() is
>> >> > directly used by NFS/AFS etc. But its main user is in fact
>> >> > set_page_writeback(), which is called in 26 places.
>> >> >
>> >> >> > think would be safest
>> >> >>
>> >> >> Okay. I'll just add it after the page lock.
>> >> >>
>> >> >> > (then you never have to bother with the writeback bit again)
>> >> >>
>> >> >> Until Fengguang does something fancy with it.
>> >> >
>> >> > Yes I'm going to do it without wait_on_page_writeback().
>> >> >
>> >> > The reason truncate_inode_pages_range() has to wait on writeback page
>> >> > is to ensure data integrity. Otherwise if there comes two events:
>> >> >        truncate page A at offset X
>> >> >        populate page B at offset X
>> >> > If A and B are all writeback pages, then B can hit disk first and then
>> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
>> >> >
>> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
>> >> > our best to isolate it as well as intercepting its IO. If the interception
>> >> > fails, it will trigger another machine check before hitting the disk.
>> >> >
>> >> > After all, poisoned A means the data at offset X is already corrupted.
>> >> > It doesn't matter if there comes another B page.
>> >> >
>> >> > Thanks,
>> >> > Fengguang
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> > Please read the FAQ at  http://www.tux.org/lkml/
>> >> >
>> >
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-08 14:46                       ` Nai Xia
@ 2009-06-09  6:48                         ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-09  6:48 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> >> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >> >
> >> >> > [snip]
> >> >> >
> >> >> >> >
> >> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >> >>
> >> >> >> So pages can be in writeback without being locked? I still
> >> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> >> obscure cases)
> >> >> >
> >> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> >> > is page *reader* :-)
> >> >>
> >> >> Sorry for maybe somewhat a little bit off topic,
> >> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >> >>
> >> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >> >
> >> > No. Note that pages are not locked in wait_on_page_writeback_range().
> >>
> >> I see. This function seems mostly called  on the sync path,
> >> it just waits for data being synchronized to disk.
> >> No writers from the pages' POV, so no lock.
> >> I missed this case, but my argument about the role of read/write lock.
> >> seems still consistent. :)
> >
> > It's more constrained. Normal read/write locks allow concurrent readers,
> > however fsync() must wait for previous IO to finish before starting
> > its own IO.
> 
> Oh, yes, this is what I called "mixed roles". One for lock, one for
> status flag, twisted together in the same path, making the read lock
> semantics totally broken.
> 
> >
> >> >
> >> >> that is the semantics of a writer waiting to get the lock while it's
> >> >> acquired by
> >> >> some reader:The caller(e.g. truncate_inode_pages_range()  and
> >> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> >> writeback readers (as you clarified ) to finish their job, right ?
> >> >
> >> > Sorry if my metaphor confused you. But they are not typical
> >> > reader/writer problems, but more about data integrities.
> >>
> >> No, you didn't :)
> >> Actually, you make me clear about the mixed roles for
> >> those bits.
> >>
> >> >
> >> > Pages have to be "not under writeback" when truncated.
> >> > Otherwise data lost is possible:
> >> >
> >> > 1) create a file with one page (page A)
> >> > 2) truncate page A that is under writeback
> >> > 3) write to file, which creates page B
> >> > 4) sync file, which sends page B to disk quickly
> >> >
> >> > Now if page B reaches disk before A, the new data will be overwritten
> >> > by truncated old data, which corrupts the file.
> >>
> >> I fully understand this scenario which you had already clarified in a
> >> previous message. :)
> >>
> >> 1. someone make index1-> page A
> >> 2. Path P1 is acting as a *reader* to a cache page at index1 by
> >>     setting PG_writeback on, while at the same time as a *writer* to
> >>     the corresponding file blocks.
> >> 3. Another path P2 comes in and  truncate page A, he is the writer
> >>     to the same cache page.
> >> 4. Yet another path P3 comes  as the writer to the cache page
> >>      making it points to page B: index1--> page B.
> >> 5. Path P4 comes writing back the cache page(and set PG_writeback).
> >>    He is the reader of the cache page and the writer to the file blocks.
> >>
> >> The corrupts occur because P1 & P4 races when writing file blocks.
> >> But the _root_ of this racing is because nothing is used to serialize
> >> them on the side of writing the file blocks and above stream reading was
> >> inconsistent because of the writers(P2 & P3) to cache page at index1.
> >>
> >> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> >> the racing still may exists. I know you must want to show me that this could
> >> make the corruption more easy to occur.
> >>
> >> So I think the simple logic is:
> >> 1) if you want to truncate/change the mapping from a index to a struct *page,
> >> test writeback bit because the writebacker to the file blocks is the reader
> >> of this mapping.
> >> 2) if a writebacker want to start a read of this mapping with
> >> test_set_page_writeback()
> >> or set_page_writeback(), he'd be sure this page is locked to keep out the
> >> writers to this mapping of index-->struct *page.
> >>
> >> This is really behavior of a read/write lock, right ?
> >
> > Please, that's a dangerous idea. A page can be written to at any time
> > when writeback to disk is under way. Does PG_writeback (your reader
> > lock) prevents page data writers?  NO.
> 
> I meant PG_writeback stops writers to index---->struct page mapping.

It's protected by the radix tree RCU locks. Period.

If you are referring to the reverse mapping: page->mapping is procted
by PG_lock. No one should make assumption that it won't change under
page writeback.

Thanks,
Fengguang

> I think I should make my statements more concise and the "reader/writer"
> less vague.
> 
> Here we care about the write/read operation for index---->struct page mapping.
> Not for read/write operation for the page content.
> 
> Anyone who wants to change this mapping  is a writer, he should take
> page lock.
> Anyone who wants to reference this mapping is a reader, writers should
> wait for him. And when this reader wants to get ref, he should wait for
> anyone one who is changing this mapping(e.g. page truncater).
> 
> When a path sets PG_writeback on a page, it need this index-->struct page
> mapping be 100% valid right? (otherwise may leads to corruption.)
> So writeback routines are readers of this index-->struct page mapping.
> (oh, well if we can put the other role of PG_writeback aside)
> 
> Ok,Ok, since PG_locked does mean much more than just protecting
> the per-page mapping which makes the lock abstraction even less clear.
> so indeed, forget about it.
> 
> >
> > Thanks,
> > Fengguang
> >
> >> wait_on_page_writeback_range() looks different only because "sync"
> >> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> >> It does care about if pages returned by pagevec_lookup_tag() are
> >> still maintains the mapping when wait_on_page_writeback(page).
> >> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> >> index->struct *page mapping.
> >>
> >> >
> >> >> So do you think the idea is sane to group the two bits together
> >> >> to form a real read/write lock, which does not care about the _number_
> >> >> of readers ?
> >> >
> >> > We don't care number of readers here. So please forget about it.
> >> Yeah, I meant number of readers is not important.
> >>
> >> I still hold that these two bits in some way act like a _sparse_
> >> read/write lock.
> >> But I am going to drop the idea of making them a pure lock, since PG_writeback
> >> does has other meaning -- the page is being writing back: for sync
> >> path, it's only
> >> a status flag.
> >> Making a pure read/write lock definitely will lose that or at least distort it.
> >>
> >>
> >> Hoping I've made my words understandable, correct me if wrong, and
> >> many thanks for your time and patience. :-)
> >>
> >>
> >> Nai Xia
> >>
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> >> > The writeback bit is _widely_ used.  test_set_page_writeback() is
> >> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> >> > set_page_writeback(), which is called in 26 places.
> >> >> >
> >> >> >> > think would be safest
> >> >> >>
> >> >> >> Okay. I'll just add it after the page lock.
> >> >> >>
> >> >> >> > (then you never have to bother with the writeback bit again)
> >> >> >>
> >> >> >> Until Fengguang does something fancy with it.
> >> >> >
> >> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >> >
> >> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> >> >        truncate page A at offset X
> >> >> >        populate page B at offset X
> >> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >> >
> >> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> >> > fails, it will trigger another machine check before hitting the disk.
> >> >> >
> >> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> >> > It doesn't matter if there comes another B page.
> >> >> >
> >> >> > Thanks,
> >> >> > Fengguang
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >> >
> >> >
> >

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-09  6:48                         ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-09  6:48 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andi Kleen, Nick Piggin, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> >> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >> >
> >> >> > [snip]
> >> >> >
> >> >> >> >
> >> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >> >>
> >> >> >> So pages can be in writeback without being locked? I still
> >> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> >> obscure cases)
> >> >> >
> >> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> >> > is page *reader* :-)
> >> >>
> >> >> Sorry for maybe somewhat a little bit off topic,
> >> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >> >>
> >> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >> >
> >> > No. Note that pages are not locked in wait_on_page_writeback_range().
> >>
> >> I see. This function seems mostly called A on the sync path,
> >> it just waits for data being synchronized to disk.
> >> No writers from the pages' POV, so no lock.
> >> I missed this case, but my argument about the role of read/write lock.
> >> seems still consistent. :)
> >
> > It's more constrained. Normal read/write locks allow concurrent readers,
> > however fsync() must wait for previous IO to finish before starting
> > its own IO.
> 
> Oh, yes, this is what I called "mixed roles". One for lock, one for
> status flag, twisted together in the same path, making the read lock
> semantics totally broken.
> 
> >
> >> >
> >> >> that is the semantics of a writer waiting to get the lock while it's
> >> >> acquired by
> >> >> some reader:The caller(e.g. truncate_inode_pages_range() A and
> >> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> >> writeback readers (as you clarified ) to finish their job, right ?
> >> >
> >> > Sorry if my metaphor confused you. But they are not typical
> >> > reader/writer problems, but more about data integrities.
> >>
> >> No, you didn't :)
> >> Actually, you make me clear about the mixed roles for
> >> those bits.
> >>
> >> >
> >> > Pages have to be "not under writeback" when truncated.
> >> > Otherwise data lost is possible:
> >> >
> >> > 1) create a file with one page (page A)
> >> > 2) truncate page A that is under writeback
> >> > 3) write to file, which creates page B
> >> > 4) sync file, which sends page B to disk quickly
> >> >
> >> > Now if page B reaches disk before A, the new data will be overwritten
> >> > by truncated old data, which corrupts the file.
> >>
> >> I fully understand this scenario which you had already clarified in a
> >> previous message. :)
> >>
> >> 1. someone make index1-> page A
> >> 2. Path P1 is acting as a *reader* to a cache page at index1 by
> >> A  A  setting PG_writeback on, while at the same time as a *writer* to
> >> A  A  the corresponding file blocks.
> >> 3. Another path P2 comes in and A truncate page A, he is the writer
> >> A  A  to the same cache page.
> >> 4. Yet another path P3 comes A as the writer to the cache page
> >> A  A  A making it points to page B: index1--> page B.
> >> 5. Path P4 comes writing back the cache page(and set PG_writeback).
> >> A  A He is the reader of the cache page and the writer to the file blocks.
> >>
> >> The corrupts occur because P1 & P4 races when writing file blocks.
> >> But the _root_ of this racing is because nothing is used to serialize
> >> them on the side of writing the file blocks and above stream reading was
> >> inconsistent because of the writers(P2 & P3) to cache page at index1.
> >>
> >> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> >> the racing still may exists. I know you must want to show me that this could
> >> make the corruption more easy to occur.
> >>
> >> So I think the simple logic is:
> >> 1) if you want to truncate/change the mapping from a index to a struct *page,
> >> test writeback bit because the writebacker to the file blocks is the reader
> >> of this mapping.
> >> 2) if a writebacker want to start a read of this mapping with
> >> test_set_page_writeback()
> >> or set_page_writeback(), he'd be sure this page is locked to keep out the
> >> writers to this mapping of index-->struct *page.
> >>
> >> This is really behavior of a read/write lock, right ?
> >
> > Please, that's a dangerous idea. A page can be written to at any time
> > when writeback to disk is under way. Does PG_writeback (your reader
> > lock) prevents page data writers? A NO.
> 
> I meant PG_writeback stops writers to index---->struct page mapping.

It's protected by the radix tree RCU locks. Period.

If you are referring to the reverse mapping: page->mapping is procted
by PG_lock. No one should make assumption that it won't change under
page writeback.

Thanks,
Fengguang

> I think I should make my statements more concise and the "reader/writer"
> less vague.
> 
> Here we care about the write/read operation for index---->struct page mapping.
> Not for read/write operation for the page content.
> 
> Anyone who wants to change this mapping  is a writer, he should take
> page lock.
> Anyone who wants to reference this mapping is a reader, writers should
> wait for him. And when this reader wants to get ref, he should wait for
> anyone one who is changing this mapping(e.g. page truncater).
> 
> When a path sets PG_writeback on a page, it need this index-->struct page
> mapping be 100% valid right? (otherwise may leads to corruption.)
> So writeback routines are readers of this index-->struct page mapping.
> (oh, well if we can put the other role of PG_writeback aside)
> 
> Ok,Ok, since PG_locked does mean much more than just protecting
> the per-page mapping which makes the lock abstraction even less clear.
> so indeed, forget about it.
> 
> >
> > Thanks,
> > Fengguang
> >
> >> wait_on_page_writeback_range() looks different only because "sync"
> >> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> >> It does care about if pages returned by pagevec_lookup_tag() are
> >> still maintains the mapping when wait_on_page_writeback(page).
> >> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> >> index->struct *page mapping.
> >>
> >> >
> >> >> So do you think the idea is sane to group the two bits together
> >> >> to form a real read/write lock, which does not care about the _number_
> >> >> of readers ?
> >> >
> >> > We don't care number of readers here. So please forget about it.
> >> Yeah, I meant number of readers is not important.
> >>
> >> I still hold that these two bits in some way act like a _sparse_
> >> read/write lock.
> >> But I am going to drop the idea of making them a pure lock, since PG_writeback
> >> does has other meaning -- the page is being writing back: for sync
> >> path, it's only
> >> a status flag.
> >> Making a pure read/write lock definitely will lose that or at least distort it.
> >>
> >>
> >> Hoping I've made my words understandable, correct me if wrong, and
> >> many thanks for your time and patience. :-)
> >>
> >>
> >> Nai Xia
> >>
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> >> > The writeback bit is _widely_ used. A test_set_page_writeback() is
> >> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> >> > set_page_writeback(), which is called in 26 places.
> >> >> >
> >> >> >> > think would be safest
> >> >> >>
> >> >> >> Okay. I'll just add it after the page lock.
> >> >> >>
> >> >> >> > (then you never have to bother with the writeback bit again)
> >> >> >>
> >> >> >> Until Fengguang does something fancy with it.
> >> >> >
> >> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >> >
> >> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> >> > A  A  A  A truncate page A at offset X
> >> >> > A  A  A  A populate page B at offset X
> >> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >> >
> >> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> >> > fails, it will trigger another machine check before hitting the disk.
> >> >> >
> >> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> >> > It doesn't matter if there comes another B page.
> >> >> >
> >> >> > Thanks,
> >> >> > Fengguang
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at A http://vger.kernel.org/majordomo-info.html
> >> >> > Please read the FAQ at A http://www.tux.org/lkml/
> >> >> >
> >> >
> >

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-09  6:48                         ` Wu Fengguang
@ 2009-06-09 10:48                           ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-09 10:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > I meant PG_writeback stops writers to index---->struct page mapping.
> 
> It's protected by the radix tree RCU locks. Period.
> 
> If you are referring to the reverse mapping: page->mapping is procted
> by PG_lock. No one should make assumption that it won't change under
> page writeback.

Well... I think probably PG_writeback should be enough. Phrased another
way: I think it is a very bad idea to truncate PG_writeback pages out of
pagecache. Does anything actually do that?

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-09 10:48                           ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-09 10:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > I meant PG_writeback stops writers to index---->struct page mapping.
> 
> It's protected by the radix tree RCU locks. Period.
> 
> If you are referring to the reverse mapping: page->mapping is procted
> by PG_lock. No one should make assumption that it won't change under
> page writeback.

Well... I think probably PG_writeback should be enough. Phrased another
way: I think it is a very bad idea to truncate PG_writeback pages out of
pagecache. Does anything actually do that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-09 10:48                           ` Nick Piggin
@ 2009-06-09 12:15                             ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-09 12:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > I meant PG_writeback stops writers to index---->struct page mapping.
> > 
> > It's protected by the radix tree RCU locks. Period.
> > 
> > If you are referring to the reverse mapping: page->mapping is procted
> > by PG_lock. No one should make assumption that it won't change under
> > page writeback.
> 
> Well... I think probably PG_writeback should be enough. Phrased another
> way: I think it is a very bad idea to truncate PG_writeback pages out of
> pagecache. Does anything actually do that?

There shall be no one. OK I will follow that convention.. 

But as I stated it is only safe do rely on the fact "no one truncates
PG_writeback pages" in end_writeback_io handlers. And I suspect if
there does exist such a handler, it could be trivially converted to
take the page lock.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-09 12:15                             ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-09 12:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > I meant PG_writeback stops writers to index---->struct page mapping.
> > 
> > It's protected by the radix tree RCU locks. Period.
> > 
> > If you are referring to the reverse mapping: page->mapping is procted
> > by PG_lock. No one should make assumption that it won't change under
> > page writeback.
> 
> Well... I think probably PG_writeback should be enough. Phrased another
> way: I think it is a very bad idea to truncate PG_writeback pages out of
> pagecache. Does anything actually do that?

There shall be no one. OK I will follow that convention.. 

But as I stated it is only safe do rely on the fact "no one truncates
PG_writeback pages" in end_writeback_io handlers. And I suspect if
there does exist such a handler, it could be trivially converted to
take the page lock.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-09 12:15                             ` Wu Fengguang
@ 2009-06-09 12:17                               ` Nick Piggin
  -1 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-09 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > > I meant PG_writeback stops writers to index---->struct page mapping.
> > > 
> > > It's protected by the radix tree RCU locks. Period.
> > > 
> > > If you are referring to the reverse mapping: page->mapping is procted
> > > by PG_lock. No one should make assumption that it won't change under
> > > page writeback.
> > 
> > Well... I think probably PG_writeback should be enough. Phrased another
> > way: I think it is a very bad idea to truncate PG_writeback pages out of
> > pagecache. Does anything actually do that?
> 
> There shall be no one. OK I will follow that convention.. 
> 
> But as I stated it is only safe do rely on the fact "no one truncates
> PG_writeback pages" in end_writeback_io handlers. And I suspect if
> there does exist such a handler, it could be trivially converted to
> take the page lock.

Well, the writeback submitter first sets writeback, then unlocks
the page. I don't think he wants a truncate coming in at that point.


^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-09 12:17                               ` Nick Piggin
  0 siblings, 0 replies; 232+ messages in thread
From: Nick Piggin @ 2009-06-09 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > > I meant PG_writeback stops writers to index---->struct page mapping.
> > > 
> > > It's protected by the radix tree RCU locks. Period.
> > > 
> > > If you are referring to the reverse mapping: page->mapping is procted
> > > by PG_lock. No one should make assumption that it won't change under
> > > page writeback.
> > 
> > Well... I think probably PG_writeback should be enough. Phrased another
> > way: I think it is a very bad idea to truncate PG_writeback pages out of
> > pagecache. Does anything actually do that?
> 
> There shall be no one. OK I will follow that convention.. 
> 
> But as I stated it is only safe do rely on the fact "no one truncates
> PG_writeback pages" in end_writeback_io handlers. And I suspect if
> there does exist such a handler, it could be trivially converted to
> take the page lock.

Well, the writeback submitter first sets writeback, then unlocks
the page. I don't think he wants a truncate coming in at that point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-09 12:17                               ` Nick Piggin
@ 2009-06-09 12:47                                 ` Wu Fengguang
  -1 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-09 12:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 08:17:22PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> > > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > > > I meant PG_writeback stops writers to index---->struct page mapping.
> > > > 
> > > > It's protected by the radix tree RCU locks. Period.
> > > > 
> > > > If you are referring to the reverse mapping: page->mapping is procted
> > > > by PG_lock. No one should make assumption that it won't change under
> > > > page writeback.
> > > 
> > > Well... I think probably PG_writeback should be enough. Phrased another
> > > way: I think it is a very bad idea to truncate PG_writeback pages out of
> > > pagecache. Does anything actually do that?
> > 
> > There shall be no one. OK I will follow that convention.. 
> > 
> > But as I stated it is only safe do rely on the fact "no one truncates
> > PG_writeback pages" in end_writeback_io handlers. And I suspect if
> > there does exist such a handler, it could be trivially converted to
> > take the page lock.
> 
> Well, the writeback submitter first sets writeback, then unlocks
> the page. I don't think he wants a truncate coming in at that point.

OK. I think we've mostly agreed on the consequences of PG_writeback vs
truncation. I'll follow the least surprise principle and stop here, hehe.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
@ 2009-06-09 12:47                                 ` Wu Fengguang
  0 siblings, 0 replies; 232+ messages in thread
From: Wu Fengguang @ 2009-06-09 12:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nai Xia, Andi Kleen, hugh, riel, akpm, chris.mason, linux-kernel,
	linux-mm

On Tue, Jun 09, 2009 at 08:17:22PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
> > > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> > > > > I meant PG_writeback stops writers to index---->struct page mapping.
> > > > 
> > > > It's protected by the radix tree RCU locks. Period.
> > > > 
> > > > If you are referring to the reverse mapping: page->mapping is procted
> > > > by PG_lock. No one should make assumption that it won't change under
> > > > page writeback.
> > > 
> > > Well... I think probably PG_writeback should be enough. Phrased another
> > > way: I think it is a very bad idea to truncate PG_writeback pages out of
> > > pagecache. Does anything actually do that?
> > 
> > There shall be no one. OK I will follow that convention.. 
> > 
> > But as I stated it is only safe do rely on the fact "no one truncates
> > PG_writeback pages" in end_writeback_io handlers. And I suspect if
> > there does exist such a handler, it could be trivially converted to
> > take the page lock.
> 
> Well, the writeback submitter first sets writeback, then unlocks
> the page. I don't think he wants a truncate coming in at that point.

OK. I think we've mostly agreed on the consequences of PG_writeback vs
truncation. I'll follow the least surprise principle and stop here, hehe.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
  2009-06-09 12:47                                 ` Wu Fengguang
@ 2009-06-09 13:36                                   ` Nai Xia
  -1 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-09 13:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 9, 2009 at 8:47 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Jun 09, 2009 at 08:17:22PM +0800, Nick Piggin wrote:
>> On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
>> > On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
>> > > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
>> > > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
>> > > > > I meant PG_writeback stops writers to index---->struct page mapping.
>> > > >
>> > > > It's protected by the radix tree RCU locks. Period.
>> > > >
>> > > > If you are referring to the reverse mapping: page->mapping is procted
>> > > > by PG_lock. No one should make assumption that it won't change under
>> > > > page writeback.
>> > >
>> > > Well... I think probably PG_writeback should be enough. Phrased another
>> > > way: I think it is a very bad idea to truncate PG_writeback pages out of
>> > > pagecache. Does anything actually do that?
>> >
>> > There shall be no one. OK I will follow that convention..
>> >
>> > But as I stated it is only safe do rely on the fact "no one truncates
>> > PG_writeback pages" in end_writeback_io handlers. And I suspect if
>> > there does exist such a handler, it could be trivially converted to
>> > take the page lock.
>>
>> Well, the writeback submitter first sets writeback, then unlocks
>> the page. I don't think he wants a truncate coming in at that point.
>
> OK. I think we've mostly agreed on the consequences of PG_writeback vs
> truncation. I'll follow the least surprise principle and stop here, hehe.

And thank you both for your time & patience, :-)

Best Regards,
Nai Xia

>
> Thanks,
> Fengguang
>

^ permalink raw reply	[flat|nested] 232+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
@ 2009-06-09 13:36                                   ` Nai Xia
  0 siblings, 0 replies; 232+ messages in thread
From: Nai Xia @ 2009-06-09 13:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nick Piggin, Andi Kleen, hugh, riel, akpm, chris.mason,
	linux-kernel, linux-mm

On Tue, Jun 9, 2009 at 8:47 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Tue, Jun 09, 2009 at 08:17:22PM +0800, Nick Piggin wrote:
>> On Tue, Jun 09, 2009 at 08:15:10PM +0800, Wu Fengguang wrote:
>> > On Tue, Jun 09, 2009 at 06:48:25PM +0800, Nick Piggin wrote:
>> > > On Tue, Jun 09, 2009 at 02:48:55PM +0800, Wu Fengguang wrote:
>> > > > On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
>> > > > > I meant PG_writeback stops writers to index---->struct page mapping.
>> > > >
>> > > > It's protected by the radix tree RCU locks. Period.
>> > > >
>> > > > If you are referring to the reverse mapping: page->mapping is procted
>> > > > by PG_lock. No one should make assumption that it won't change under
>> > > > page writeback.
>> > >
>> > > Well... I think probably PG_writeback should be enough. Phrased another
>> > > way: I think it is a very bad idea to truncate PG_writeback pages out of
>> > > pagecache. Does anything actually do that?
>> >
>> > There shall be no one. OK I will follow that convention..
>> >
>> > But as I stated it is only safe do rely on the fact "no one truncates
>> > PG_writeback pages" in end_writeback_io handlers. And I suspect if
>> > there does exist such a handler, it could be trivially converted to
>> > take the page lock.
>>
>> Well, the writeback submitter first sets writeback, then unlocks
>> the page. I don't think he wants a truncate coming in at that point.
>
> OK. I think we've mostly agreed on the consequences of PG_writeback vs
> truncation. I'll follow the least surprise principle and stop here, hehe.

And thank you both for your time & patience, :-)

Best Regards,
Nai Xia

>
> Thanks,
> Fengguang
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 232+ messages in thread

end of thread, other threads:[~2009-06-09 13:36 UTC | newest]

Thread overview: 232+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-27 20:12 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
2009-05-27 20:12 ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:35   ` Larry H.
2009-05-27 20:35     ` Larry H.
2009-05-27 21:15   ` Alan Cox
2009-05-27 21:15     ` Alan Cox
2009-05-28  7:54     ` Andi Kleen
2009-05-28  7:54       ` Andi Kleen
2009-05-29 16:10       ` Rik van Riel
2009-05-29 16:10         ` Rik van Riel
2009-05-29 16:37         ` Andi Kleen
2009-05-29 16:37           ` Andi Kleen
2009-05-29 16:34           ` Rik van Riel
2009-05-29 16:34             ` Rik van Riel
2009-05-29 18:24             ` Andi Kleen
2009-05-29 18:24               ` Andi Kleen
2009-05-29 18:26               ` Rik van Riel
2009-05-29 18:26                 ` Rik van Riel
2009-05-29 18:42                 ` Andi Kleen
2009-05-29 18:42                   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-29 16:37   ` Rik van Riel
2009-05-29 16:37     ` Rik van Riel
2009-05-27 20:12 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-28  8:46   ` Hidehiro Kawai
2009-05-28  8:46     ` Hidehiro Kawai
2009-05-28  9:11     ` Wu Fengguang
2009-05-28  9:11       ` Wu Fengguang
2009-05-28 10:42     ` Andi Kleen
2009-05-28 10:42       ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2 Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-29  4:15   ` Hidehiro Kawai
2009-05-29  4:15     ` Hidehiro Kawai
2009-05-29  6:28     ` Andi Kleen
2009-05-29  6:28       ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-28  7:27   ` Nick Piggin
2009-05-28  7:27     ` Nick Piggin
2009-05-28  8:03     ` Andi Kleen
2009-05-28  8:03       ` Andi Kleen
2009-05-28  8:28       ` Nick Piggin
2009-05-28  8:28         ` Nick Piggin
2009-05-28  9:02         ` Andi Kleen
2009-05-28  9:02           ` Andi Kleen
2009-05-28 12:26           ` Nick Piggin
2009-05-28 12:26             ` Nick Piggin
2009-05-27 20:12 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3 Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-28  8:26   ` Nick Piggin
2009-05-28  8:26     ` Nick Piggin
2009-05-28  9:31     ` Andi Kleen
2009-05-28  9:31       ` Andi Kleen
2009-05-28 12:08       ` Nick Piggin
2009-05-28 12:08         ` Nick Piggin
2009-05-28 13:45         ` Andi Kleen
2009-05-28 13:45           ` Andi Kleen
2009-05-28 14:50           ` Wu Fengguang
2009-05-28 14:50             ` Wu Fengguang
2009-06-04  6:25             ` Nai Xia
2009-06-04  6:25               ` Nai Xia
2009-06-07 16:02               ` Wu Fengguang
2009-06-07 16:02                 ` Wu Fengguang
2009-06-08 11:06                 ` Nai Xia
2009-06-08 11:06                   ` Nai Xia
2009-06-08 12:31                   ` Wu Fengguang
2009-06-08 12:31                     ` Wu Fengguang
2009-06-08 14:46                     ` Nai Xia
2009-06-08 14:46                       ` Nai Xia
2009-06-09  6:48                       ` Wu Fengguang
2009-06-09  6:48                         ` Wu Fengguang
2009-06-09 10:48                         ` Nick Piggin
2009-06-09 10:48                           ` Nick Piggin
2009-06-09 12:15                           ` Wu Fengguang
2009-06-09 12:15                             ` Wu Fengguang
2009-06-09 12:17                             ` Nick Piggin
2009-06-09 12:17                               ` Nick Piggin
2009-06-09 12:47                               ` Wu Fengguang
2009-06-09 12:47                                 ` Wu Fengguang
2009-06-09 13:36                                 ` Nai Xia
2009-06-09 13:36                                   ` Nai Xia
2009-05-28 16:56           ` Russ Anderson
2009-05-28 16:56             ` Russ Anderson
2009-05-30  6:42             ` Andi Kleen
2009-05-30  6:42               ` Andi Kleen
2009-06-01 11:39               ` Nick Piggin
2009-06-01 11:39                 ` Nick Piggin
2009-06-01 18:19                 ` Andi Kleen
2009-06-01 18:19                   ` Andi Kleen
2009-06-01 12:05           ` Nick Piggin
2009-06-01 12:05             ` Nick Piggin
2009-06-01 18:51             ` Andi Kleen
2009-06-01 18:51               ` Andi Kleen
2009-06-02 12:10               ` Nick Piggin
2009-06-02 12:10                 ` Nick Piggin
2009-06-02 12:34                 ` Andi Kleen
2009-06-02 12:34                   ` Andi Kleen
2009-06-02 12:37                   ` Nick Piggin
2009-06-02 12:37                     ` Nick Piggin
2009-06-02 12:55                     ` Andi Kleen
2009-06-02 12:55                       ` Andi Kleen
2009-06-02 13:03                       ` Nick Piggin
2009-06-02 13:03                         ` Nick Piggin
2009-06-02 13:20                         ` Andi Kleen
2009-06-02 13:20                           ` Andi Kleen
2009-06-02 13:19                           ` Nick Piggin
2009-06-02 13:19                             ` Nick Piggin
2009-06-02 13:46                             ` Andi Kleen
2009-06-02 13:46                               ` Andi Kleen
2009-06-02 13:47                               ` Nick Piggin
2009-06-02 13:47                                 ` Nick Piggin
2009-06-02 14:05                                 ` Andi Kleen
2009-06-02 14:05                                   ` Andi Kleen
2009-06-02 13:30                     ` Wu Fengguang
2009-06-02 13:30                       ` Wu Fengguang
2009-06-02 14:07                       ` Nick Piggin
2009-06-02 14:07                         ` Nick Piggin
2009-05-28  9:59     ` Wu Fengguang
2009-05-28  9:59       ` Wu Fengguang
2009-05-28 10:11       ` Andi Kleen
2009-05-28 10:11         ` Andi Kleen
2009-05-28 10:33         ` Wu Fengguang
2009-05-28 10:33           ` Wu Fengguang
2009-05-28 10:51           ` Andi Kleen
2009-05-28 10:51             ` Andi Kleen
2009-05-28 11:03             ` Wu Fengguang
2009-05-28 11:03               ` Wu Fengguang
2009-05-28 12:15             ` Nick Piggin
2009-05-28 12:15               ` Nick Piggin
2009-05-28 13:48               ` Andi Kleen
2009-05-28 13:48                 ` Andi Kleen
2009-05-28 12:23       ` Nick Piggin
2009-05-28 12:23         ` Nick Piggin
2009-05-28 13:54         ` Wu Fengguang
2009-05-28 13:54           ` Wu Fengguang
2009-06-01 11:50           ` Nick Piggin
2009-06-01 11:50             ` Nick Piggin
2009-06-01 14:05             ` Wu Fengguang
2009-06-01 14:05               ` Wu Fengguang
2009-06-01 14:40               ` Nick Piggin
2009-06-01 14:40                 ` Nick Piggin
2009-06-02 11:14                 ` Wu Fengguang
2009-06-02 11:14                   ` Wu Fengguang
2009-06-02 12:19                   ` Nick Piggin
2009-06-02 12:19                     ` Nick Piggin
2009-06-02 12:51                     ` Wu Fengguang
2009-06-02 12:51                       ` Wu Fengguang
2009-06-02 14:33                       ` Nick Piggin
2009-06-02 14:33                         ` Nick Piggin
2009-06-03 10:21                       ` Jens Axboe
2009-06-03 10:21                         ` Jens Axboe
2009-06-01 21:11               ` Hugh Dickins
2009-06-01 21:11                 ` Hugh Dickins
2009-06-01 21:41                 ` Andi Kleen
2009-06-01 21:41                   ` Andi Kleen
2009-06-01 18:32             ` Andi Kleen
2009-06-01 18:32               ` Andi Kleen
2009-06-02 12:00               ` Nick Piggin
2009-06-02 12:00                 ` Nick Piggin
2009-06-02 12:47                 ` Andi Kleen
2009-06-02 12:47                   ` Andi Kleen
2009-06-02 12:57                   ` Nick Piggin
2009-06-02 12:57                     ` Nick Piggin
2009-06-02 13:25                     ` Andi Kleen
2009-06-02 13:25                       ` Andi Kleen
2009-06-02 13:24                       ` Nick Piggin
2009-06-02 13:24                         ` Nick Piggin
2009-06-02 13:41                         ` Andi Kleen
2009-06-02 13:41                           ` Andi Kleen
2009-06-02 13:40                           ` Nick Piggin
2009-06-02 13:40                             ` Nick Piggin
2009-06-02 13:53                           ` Wu Fengguang
2009-06-02 13:53                             ` Wu Fengguang
2009-06-02 14:06                             ` Andi Kleen
2009-06-02 14:06                               ` Andi Kleen
2009-06-02 14:12                               ` Wu Fengguang
2009-06-02 14:12                                 ` Wu Fengguang
2009-06-02 14:21                                 ` Nick Piggin
2009-06-02 14:21                                   ` Nick Piggin
2009-06-02 13:46                     ` Wu Fengguang
2009-06-02 13:46                       ` Wu Fengguang
2009-06-02 14:08                       ` Andi Kleen
2009-06-02 14:08                         ` Andi Kleen
2009-06-02 14:10                         ` Wu Fengguang
2009-06-02 14:10                           ` Wu Fengguang
2009-06-02 14:14                           ` Nick Piggin
2009-06-02 14:14                             ` Nick Piggin
2009-06-02 15:17                       ` Nick Piggin
2009-06-02 15:17                         ` Nick Piggin
2009-06-02 17:27                         ` Andi Kleen
2009-06-02 17:27                           ` Andi Kleen
2009-06-03  9:35                           ` Nick Piggin
2009-06-03  9:35                             ` Nick Piggin
2009-06-03 11:24                             ` Andi Kleen
2009-06-03 11:24                               ` Andi Kleen
2009-06-02 13:02                   ` Wu Fengguang
2009-06-02 13:02                     ` Wu Fengguang
2009-06-02 15:09                   ` Nick Piggin
2009-06-02 15:09                     ` Nick Piggin
2009-06-02 17:19                     ` Andi Kleen
2009-06-02 17:19                       ` Andi Kleen
2009-06-03  6:24                       ` Nick Piggin
2009-06-03  6:24                         ` Nick Piggin
2009-06-03 15:51               ` Wu Fengguang
2009-06-03 15:51                 ` Wu Fengguang
2009-06-03 16:05                 ` Andi Kleen
2009-06-03 16:05                   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
2009-05-27 20:12   ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen
2009-05-27 20:12   ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.