[PATCH 0/5] [RFC] HWPOISON incremental fixes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] [RFC] HWPOISON incremental fixes
@ 2009-06-11 14:22 Wu Fengguang
  2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-11 14:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Hugh Dickins, Nick Piggin, Andi Kleen, riel, chris.mason,
	linux-mm, Wu, Fengguang

Hi all,

Here are the hwpoison fixes that aims to address Nick and Hugh's concerns.
Note that
- the early kill option is dropped for .31. It's obscure option and complex
  code and is not must have for .31. Maybe Andi also aims this option for
  notifying KVM, but right now KVM is not ready to handle that.
- It seems that even fsync() processes are not easy to catch, so I abandoned
  the SIGKILL on fsync() idea. Instead, I choose to fail any attempt to
  populate the poisoned file with new pages, so that the corrupted page offset
  won't be repopulated with outdated data. This seems to be a safe way to allow
  the process to continue running while still be able to promise good (but not
  complete) data consistency.
- I didn't implement the PANIC-on-corrupted-data option. Instead, I guess
  sending uevent notification to user space will be a more flexible scheme?

Thanks,
Fengguang
-- 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
@ 2009-06-11 14:22 ` Wu Fengguang
  2009-06-11 15:44   ` Rik van Riel
                     ` (2 more replies)
  2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-11 14:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: hwpoison-remove-ifdef.patch --]
[-- Type: text/plain, Size: 1480 bytes --]

So as to eliminate one #ifdef in the c source.

Proposed by Nick Piggin.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 arch/x86/mm/fault.c |    3 +--
 include/linux/mm.h  |    7 ++++++-
 2 files changed, 7 insertions(+), 3 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-#ifdef CONFIG_MEMORY_FAILURE
 	if (fault & VM_FAULT_HWPOISON) {
 		printk(KERN_ERR
 	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
 			tsk->comm, tsk->pid, address);
 		code = BUS_MCEERR_AR;
 	}
-#endif
+
 	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -702,11 +702,16 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
-#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
+#ifdef CONFIG_MEMORY_FAILURE
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
+#else
+#define VM_FAULT_HWPOISON 0
+#endif
+
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*

-- 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order
  2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
  2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
@ 2009-06-11 14:22 ` Wu Fengguang
  2009-06-11 15:59   ` Rik van Riel
  2009-06-12 10:03   ` Andi Kleen
  2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-11 14:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Wu Fengguang, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: hwpoison-lock-order.patch --]
[-- Type: text/plain, Size: 1331 bytes --]

To avoid possible deadlock. Proposed by Nick Piggin:

  You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
  lock. And anon_vma lock nests inside i_mmap_lock.

  This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
  type (maybe -rt kernels do it), then you could have a task holding
  anon_vma lock and waiting for tasklist_lock, and another holding tasklist
  lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
  waiting for anon_vma lock.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -215,12 +215,14 @@ static void collect_procs_anon(struct pa
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct anon_vma *av = page_lock_anon_vma(page);
+	struct anon_vma *av;
 
+	read_lock(&tasklist_lock);
+
+	av = page_lock_anon_vma(page);
 	if (av == NULL) /* Not actually mapped anymore */
-		return;
+		goto out;
 
-	read_lock(&tasklist_lock);
 	for_each_process (tsk) {
 		if (!tsk->mm)
 			continue;
@@ -230,6 +232,7 @@ static void collect_procs_anon(struct pa
 		}
 	}
 	page_unlock_anon_vma(av);
+out:
 	read_unlock(&tasklist_lock);
 }
 

-- 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/5] HWPOISON: remove early kill option for now
  2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
  2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
  2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
@ 2009-06-11 14:22 ` Wu Fengguang
  2009-06-11 16:06   ` Rik van Riel
  2009-06-12  9:59   ` Andi Kleen
  2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-11 14:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Nick Piggin, Hugh Dickins, Wu Fengguang, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: hwpoison-remove-early-kill.patch --]
[-- Type: text/plain, Size: 14475 bytes --]

It needs more thoughts, and is not a must have for .31.

CC: Nick Piggin <npiggin@suse.de>
CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/sysctl/vm.txt |   28 ---
 include/linux/mm.h          |    1 
 include/linux/rmap.h        |    6 
 kernel/sysctl.c             |   13 -
 mm/filemap.c                |    4 
 mm/memory-failure.c         |  272 ----------------------------------
 mm/rmap.c                   |    8 -
 7 files changed, 3 insertions(+), 329 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -48,251 +48,9 @@
 #include <linux/backing-dev.h>
 #include "internal.h"
 
-int sysctl_memory_failure_early_kill __read_mostly = 1;
-
 atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 /*
- * Send all the processes who have the page mapped an ``action optional''
- * signal.
- */
-static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
-			unsigned long pfn)
-{
-	struct siginfo si;
-	int ret;
-
-	printk(KERN_ERR
-	       "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
-	       pfn, t->comm, t->pid);
-	si.si_signo = SIGBUS;
-	si.si_errno = 0;
-	si.si_code = BUS_MCEERR_AO;
-	si.si_addr = (void *)addr;
-#ifdef __ARCH_SI_TRAPNO
-	si.si_trapno = trapno;
-#endif
-	si.si_addr_lsb = PAGE_SHIFT;
-	/*
-	 * Don't use force here, it's convenient if the signal
-	 * can be temporarily blocked.
-	 * This could cause a loop when the user sets SIGBUS
-	 * to SIG_IGN, but hopefully noone will do that?
-	 */
-	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
-	if (ret < 0)
-		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
-		       t->comm, t->pid, ret);
-	return ret;
-}
-
-/*
- * Kill all processes that have a poisoned page mapped and then isolate
- * the page.
- *
- * General strategy:
- * Find all processes having the page mapped and kill them.
- * But we keep a page reference around so that the page is not
- * actually freed yet.
- * Then stash the page away
- *
- * There's no convenient way to get back to mapped processes
- * from the VMAs. So do a brute-force search over all
- * running processes.
- *
- * Remember that machine checks are not common (or rather
- * if they are common you have other problems), so this shouldn't
- * be a performance issue.
- *
- * Also there are some races possible while we get from the
- * error detection to actually handle it.
- */
-
-struct to_kill {
-	struct list_head nd;
-	struct task_struct *tsk;
-	unsigned long addr;
-	unsigned addr_valid:1;
-};
-
-/*
- * Failure handling: if we can't find or can't kill a process there's
- * not much we can do. We just print a message and ignore otherwise.
- */
-
-/*
- * Schedule a process for later kill.
- * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
- */
-static void add_to_kill(struct task_struct *tsk, struct page *p,
-			struct vm_area_struct *vma,
-			struct list_head *to_kill,
-			struct to_kill **tkc)
-{
-	struct to_kill *tk;
-
-	if (*tkc) {
-		tk = *tkc;
-		*tkc = NULL;
-	} else {
-		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
-		if (!tk) {
-			printk(KERN_ERR
-		"MCE: Out of memory while machine check handling\n");
-			return;
-		}
-	}
-	tk->addr = page_address_in_vma(p, vma);
-	tk->addr_valid = 1;
-
-	/*
-	 * In theory we don't have to kill when the page was
-	 * munmaped. But it could be also a mremap. Since that's
-	 * likely very rare kill anyways just out of paranoia, but use
-	 * a SIGKILL because the error is not contained anymore.
-	 */
-	if (tk->addr == -EFAULT) {
-		pr_debug("MCE: Unable to find user space address %lx in %s\n",
-			 page_to_pfn(p), tsk->comm);
-		tk->addr_valid = 0;
-	}
-	get_task_struct(tsk);
-	tk->tsk = tsk;
-	list_add_tail(&tk->nd, to_kill);
-}
-
-/*
- * Kill the processes that have been collected earlier.
- *
- * Only do anything when DOIT is set, otherwise just free the list
- * (this is used for clean pages which do not need killing)
- * Also when FAIL is set do a force kill because something went
- * wrong earlier.
- */
-static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
-			  int fail, unsigned long pfn)
-{
-	struct to_kill *tk, *next;
-
-	list_for_each_entry_safe (tk, next, to_kill, nd) {
-		if (doit) {
-			/*
-			 * In case something went wrong with munmaping
-			 * make sure the process doesn't catch the
-			 * signal and then access the memory. Just kill it.
-			 * the signal handlers
-			 */
-			if (fail || tk->addr_valid == 0) {
-				printk(KERN_ERR
-		"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
-					pfn, tk->tsk->comm, tk->tsk->pid);
-				force_sig(SIGKILL, tk->tsk);
-			}
-
-			/*
-			 * In theory the process could have mapped
-			 * something else on the address in-between. We could
-			 * check for that, but we need to tell the
-			 * process anyways.
-			 */
-			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
-					      pfn) < 0)
-				printk(KERN_ERR
-		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
-					pfn, tk->tsk->comm, tk->tsk->pid);
-		}
-		put_task_struct(tk->tsk);
-		kfree(tk);
-	}
-}
-
-/*
- * Collect processes when the error hit an anonymous page.
- */
-static void collect_procs_anon(struct page *page, struct list_head *to_kill,
-			       struct to_kill **tkc)
-{
-	struct vm_area_struct *vma;
-	struct task_struct *tsk;
-	struct anon_vma *av;
-
-	read_lock(&tasklist_lock);
-
-	av = page_lock_anon_vma(page);
-	if (av == NULL) /* Not actually mapped anymore */
-		goto out;
-
-	for_each_process (tsk) {
-		if (!tsk->mm)
-			continue;
-		list_for_each_entry (vma, &av->head, anon_vma_node) {
-			if (vma->vm_mm == tsk->mm)
-				add_to_kill(tsk, page, vma, to_kill, tkc);
-		}
-	}
-	page_unlock_anon_vma(av);
-out:
-	read_unlock(&tasklist_lock);
-}
-
-/*
- * Collect processes when the error hit a file mapped page.
- */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-			       struct to_kill **tkc)
-{
-	struct vm_area_struct *vma;
-	struct task_struct *tsk;
-	struct prio_tree_iter iter;
-	struct address_space *mapping = page_mapping(page);
-
-	/*
-	 * A note on the locking order between the two locks.
-	 * We don't rely on this particular order.
-	 * If you have some other code that needs a different order
-	 * feel free to switch them around. Or add a reverse link
-	 * from mm_struct to task_struct, then this could be all
-	 * done without taking tasklist_lock and looping over all tasks.
-	 */
-
-	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
-	for_each_process(tsk) {
-		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-
-		if (!tsk->mm)
-			continue;
-
-		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
-				     pgoff)
-			if (vma->vm_mm == tsk->mm)
-				add_to_kill(tsk, page, vma, to_kill, tkc);
-	}
-	spin_unlock(&mapping->i_mmap_lock);
-	read_unlock(&tasklist_lock);
-}
-
-/*
- * Collect the processes who have the corrupted page mapped to kill.
- * This is done in two steps for locking reasons.
- * First preallocate one tokill structure outside the spin locks,
- * so that we can kill at least one process reasonably reliable.
- */
-static void collect_procs(struct page *page, struct list_head *tokill)
-{
-	struct to_kill *tk;
-
-	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
-	/* memory allocation failure is implicitly handled */
-	if (PageAnon(page))
-		collect_procs_anon(page, tokill, &tk);
-	else
-		collect_procs_file(page, tokill, &tk);
-	kfree(tk);
-}
-
-/*
  * Error handlers for various types of pages.
  */
 
@@ -599,7 +357,6 @@ static void hwpoison_user_mappings(struc
 				   int trapno)
 {
 	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
-	int kill = sysctl_memory_failure_early_kill;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
@@ -633,7 +390,6 @@ static void hwpoison_user_mappings(struc
 		if (page_mkclean(p))
 			SetPageDirty(p);
 		else {
-			kill = 0;
 			ttu |= TTU_IGNORE_HWPOISON;
 			printk(KERN_INFO
 	"MCE %#lx: corrupted page was clean: dropped without side effects\n",
@@ -642,22 +398,6 @@ static void hwpoison_user_mappings(struc
 	}
 
 	/*
-	 * First collect all the processes that have the page
-	 * mapped.  This has to be done before try_to_unmap,
-	 * because ttu takes the rmap data structures down.
-	 *
-	 * This also has the side effect to propagate the dirty
-	 * bit from PTEs into the struct page. This is needed
-	 * to actually decide if something needs to be killed
-	 * or errored, or if it's ok to just drop the page.
-	 *
-	 * Error handling: We ignore errors here because
-	 * there's nothing that can be done.
-	 */
-	if (kill)
-		collect_procs(p, &tokill);
-
-	/*
 	 * try_to_unmap can fail temporarily due to races.
 	 * Try a few times (RED-PEN better strategy?)
 	 */
@@ -671,18 +411,6 @@ static void hwpoison_user_mappings(struc
 	if (ret != SWAP_SUCCESS)
 		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
 				pfn, page_mapcount(p));
-
-	/*
-	 * Now that the dirty bit has been propagated to the
-	 * struct page and all unmaps done we can decide if
-	 * killing is needed or not.  Only kill when the page
-	 * was dirty, otherwise the tokill list is merely
-	 * freed.  When there was a problem unmapping earlier
-	 * use a more force-full uncatchable kill to prevent
-	 * any accesses to the poisoned memory.
-	 */
-	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
-		      ret != SWAP_SUCCESS, pfn);
 }
 
 /**
--- sound-2.6.orig/Documentation/sysctl/vm.txt
+++ sound-2.6/Documentation/sysctl/vm.txt
@@ -32,7 +32,6 @@ Currently, these files are in /proc/sys/
 - legacy_va_layout
 - lowmem_reserve_ratio
 - max_map_count
-- memory_failure_early_kill
 - min_free_kbytes
 - min_slab_ratio
 - min_unmapped_ratio
@@ -54,6 +53,7 @@ Currently, these files are in /proc/sys/
 - vfs_cache_pressure
 - zone_reclaim_mode
 
+
 ==============================================================
 
 block_dump
@@ -275,32 +275,6 @@ e.g., up to one or two maps per allocati
 
 The default value is 65536.
 
-=============================================================
-
-memory_failure_early_kill:
-
-Control how to kill processes when uncorrected memory error (typically
-a 2bit error in a memory module) is detected in the background by hardware
-that cannot be handled by the kernel. In some cases (like the page
-still having a valid copy on disk) the kernel will handle the failure
-transparently without affecting any applications. But if there is
-no other uptodate copy of the data it will kill to prevent any data
-corruptions from propagating.
-
-1: Kill all processes that have the corrupted and not reloadable page mapped
-as soon as the corruption is detected.  Note this is not supported
-for a few types of pages, like kernel internally allocated data or
-the swap cache, but works for the majority of user pages.
-
-0: Only unmap the corrupted page from all processes and only kill a process
-who tries to access it.
-
-The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
-handle this if they want to.
-
-This is only active on architectures/platforms with advanced machine
-check handling and depends on the hardware capabilities.
-
 ==============================================================
 
 min_free_kbytes:
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -1331,7 +1331,6 @@ extern int account_locked_memory(struct 
 extern void refund_locked_memory(struct mm_struct *mm, size_t size);
 
 extern void memory_failure(unsigned long pfn, int trapno);
-extern int sysctl_memory_failure_early_kill;
 extern atomic_long_t mce_bad_pages;
 
 #endif /* __KERNEL__ */
--- sound-2.6.orig/kernel/sysctl.c
+++ sound-2.6/kernel/sysctl.c
@@ -1319,19 +1319,6 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &scan_unevictable_handler,
 	},
-#ifdef CONFIG_MEMORY_FAILURE
-       {
-               .ctl_name       = CTL_UNNUMBERED,
-               .procname       = "memory_failure_early_kill",
-               .data           = &sysctl_memory_failure_early_kill,
-               .maxlen         = sizeof(sysctl_memory_failure_early_kill),
-               .mode           = 0644,
-               .proc_handler   = &proc_dointvec_minmax,
-               .strategy       = &sysctl_intvec,
-               .extra1         = &zero,
-               .extra2         = &one,
-       },
-#endif
 
 /*
  * NOTE: do not add new entries to this table unless you have read
--- sound-2.6.orig/mm/filemap.c
+++ sound-2.6/mm/filemap.c
@@ -105,10 +105,6 @@
  *
  *  ->task->proc_lock
  *    ->dcache_lock		(proc_pid_lookup)
- *
- *  (code doesn't rely on that order, so you could switch it around)
- *  ->tasklist_lock             (memory_failure, collect_procs_ao)
- *    ->i_mmap_lock
  */
 
 /*
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -36,10 +36,6 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
- *
- * (code doesn't rely on that order so it could be switched around)
- * ->tasklist_lock
- *   anon_vma->lock      (memory_failure, collect_procs_anon)
  */
 
 #include <linux/mm.h>
@@ -195,7 +191,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+static struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -215,7 +211,7 @@ out:
 	return NULL;
 }
 
-void page_unlock_anon_vma(struct anon_vma *anon_vma)
+static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -129,12 +129,6 @@ int try_to_munlock(struct page *);
 int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
 #endif
 
-/*
- * Called by memory-failure.c to kill processes.
- */
-struct anon_vma *page_lock_anon_vma(struct page *page);
-void page_unlock_anon_vma(struct anon_vma *anon_vma);
-
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)

-- 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file
  2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
                   ` (2 preceding siblings ...)
  2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
@ 2009-06-11 14:22 ` Wu Fengguang
  2009-06-11 16:31   ` Rik van Riel
  2009-06-12 10:07   ` Andi Kleen
  2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
  2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
  5 siblings, 2 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-11 14:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Hugh Dickins, Nick Piggin, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: hwpoison-more-sticky-eio.patch --]
[-- Type: text/plain, Size: 2562 bytes --]

This makes the EIO reports on write(), fsync(), or the NFS close()
sticky enough. The only way to get rid of it may be

	echo 3 > /proc/sys/vm/drop_caches

Note that the impacted process will only be killed if it mapped the page.
XXX
via read()/write()/fsync() instead of memory mapped reads/writes, simply
because it's very hard to find them.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   11 +++++++++++
 mm/memory-failure.c     |    2 +-
 3 files changed, 25 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/linux/pagemap.h
+++ sound-2.6/include/linux/pagemap.h
@@ -23,6 +23,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_HWPOISON	= __GFP_BITS_SHIFT + 4,	/* hardware memory corruption */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -52,6 +53,18 @@ static inline int mapping_unevictable(st
 	return !!mapping;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+static inline int mapping_hwpoison(struct address_space *mapping)
+{
+	return test_bit(AS_HWPOISON, &mapping->flags);
+}
+#else
+static inline int mapping_hwpoison(struct address_space *mapping)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
--- sound-2.6.orig/mm/filemap.c
+++ sound-2.6/mm/filemap.c
@@ -302,6 +302,8 @@ int wait_on_page_writeback_range(struct 
 		ret = -ENOSPC;
 	if (test_and_clear_bit(AS_EIO, &mapping->flags))
 		ret = -EIO;
+	if (mapping_hwpoison(mapping))
+		ret = -EIO;
 
 	return ret;
 }
@@ -460,6 +462,15 @@ int add_to_page_cache_locked(struct page
 
 	VM_BUG_ON(!PageLocked(page));
 
+	/*
+	 * Hardware corrupted page will be removed from mapping,
+	 * so we want to deny (possibly) reloading the old data.
+	 */
+	if (unlikely(mapping_hwpoison(mapping))) {
+		error = -EIO;
+		goto out;
+	}
+
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -184,7 +184,7 @@ static int me_pagecache_dirty(struct pag
 		 * the first EIO, but we're not worse than other parts
 		 * of the kernel.
 		 */
-		mapping_set_error(mapping, EIO);
+		set_bit(AS_HWPOISON, &mapping->flags);
 	}
 
 	return me_pagecache_clean(p, pfn);

-- 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages
  2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
                   ` (3 preceding siblings ...)
  2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
@ 2009-06-11 14:22 ` Wu Fengguang
  2009-06-11 16:36   ` Rik van Riel
  2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
  5 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2009-06-11 14:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Wu Fengguang, Hugh Dickins, Nick Piggin, Andi Kleen, riel,
	chris.mason, linux-mm

[-- Attachment #1: hwpoison-skip-metadata.patch --]
[-- Type: text/plain, Size: 2479 bytes --]

As recommended by Nick Piggin.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h  |    1 +
 mm/memory-failure.c |   23 +++++++++++++----------
 mm/truncate.c       |    3 +--
 3 files changed, 15 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -113,6 +113,10 @@ static int me_pagecache_clean(struct pag
 	if (!isolate_lru_page(p))
 		page_cache_release(p);
 
+	mapping = page_mapping(p);
+	if (mapping == NULL)
+		return RECOVERED;
+
 	/*
 	 * Now truncate the page in the page cache. This is really
 	 * more like a "temporary hole punch"
@@ -120,20 +124,19 @@ static int me_pagecache_clean(struct pag
 	 * has a reference, because it could be file system metadata
 	 * and that's not safe to truncate.
 	 */
-	mapping = page_mapping(p);
-	if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
+	if (!S_ISREG(mapping->host->i_mode) &&
+	    !invalidate_complete_page(mapping, p)) {
 		printk(KERN_ERR
-		       "MCE %#lx: page looks like a unsupported file system metadata page\n",
+		       "MCE %#lx: failed to invalidate metadata page\n",
 		       pfn);
 		return FAILED;
 	}
-	if (mapping) {
-		truncate_inode_page(mapping, p);
-		if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
-			pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
-				 pfn);
-			return FAILED;
-		}
+
+	truncate_inode_page(mapping, p);
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
+		pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			 pfn);
+		return FAILED;
 	}
 	return RECOVERED;
 }
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -817,6 +817,7 @@ extern int vmtruncate(struct inode * ino
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
 void truncate_inode_page(struct address_space *mapping, struct page *page);
+int invalidate_complete_page(struct address_space *mapping, struct page *page);
 
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -118,8 +118,7 @@ truncate_complete_page(struct address_sp
  *
  * Returns non-zero if the page was successfully invalidated.
  */
-static int
-invalidate_complete_page(struct address_space *mapping, struct page *page)
+int invalidate_complete_page(struct address_space *mapping, struct page *page)
 {
 	int ret;
 

-- 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
@ 2009-06-11 15:44   ` Rik van Riel
  2009-06-12 10:00   ` Andi Kleen
  2009-06-12 11:22   ` Ingo Molnar
  2 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2009-06-11 15:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen,
	chris.mason, linux-mm

Wu Fengguang wrote:
> So as to eliminate one #ifdef in the c source.
> 
> Proposed by Nick Piggin.
> 
> CC: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order
  2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
@ 2009-06-11 15:59   ` Rik van Riel
  2009-06-12 10:03   ` Andi Kleen
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2009-06-11 15:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen,
	chris.mason, linux-mm

Wu Fengguang wrote:
> To avoid possible deadlock. Proposed by Nick Piggin:
> 
>   You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
>   lock. And anon_vma lock nests inside i_mmap_lock.
> 
>   This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
>   type (maybe -rt kernels do it), then you could have a task holding
>   anon_vma lock and waiting for tasklist_lock, and another holding tasklist
>   lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
>   waiting for anon_vma lock.
> 
> CC: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/5] HWPOISON: remove early kill option for now
  2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
@ 2009-06-11 16:06   ` Rik van Riel
  2009-06-12  9:59   ` Andi Kleen
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2009-06-11 16:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen,
	chris.mason, linux-mm

Wu Fengguang wrote:
> It needs more thoughts, and is not a must have for .31.
> 
> CC: Nick Piggin <npiggin@suse.de>
> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Fair enough.  It's not an absolute must-have and still needs
a little bit more work.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file
  2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
@ 2009-06-11 16:31   ` Rik van Riel
  2009-06-12 10:07   ` Andi Kleen
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2009-06-11 16:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Hugh Dickins, Nick Piggin, Andi Kleen,
	chris.mason, linux-mm

Wu Fengguang wrote:
> This makes the EIO reports on write(), fsync(), or the NFS close()
> sticky enough. The only way to get rid of it may be
> 
> 	echo 3 > /proc/sys/vm/drop_caches
> 
> Note that the impacted process will only be killed if it mapped the page.
> XXX
> via read()/write()/fsync() instead of memory mapped reads/writes, simply
> because it's very hard to find them.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages
  2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
@ 2009-06-11 16:36   ` Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2009-06-11 16:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Hugh Dickins, Nick Piggin, Andi Kleen,
	chris.mason, linux-mm

Wu Fengguang wrote:
> As recommended by Nick Piggin.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/5] HWPOISON: remove early kill option for now
  2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
  2009-06-11 16:06   ` Rik van Riel
@ 2009-06-12  9:59   ` Andi Kleen
  1 sibling, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2009-06-12  9:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

On Thu, Jun 11, 2009 at 10:22:42PM +0800, Wu Fengguang wrote:
> It needs more thoughts, and is not a must have for .31.

Please don't do that. I don't think the problem Hugh described is fatal
and there are some scenarios where it is needed.


-Andi

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
  2009-06-11 15:44   ` Rik van Riel
@ 2009-06-12 10:00   ` Andi Kleen
  2009-06-12 13:15     ` Wu Fengguang
  2009-06-12 11:22   ` Ingo Molnar
  2 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2009-06-12 10:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

On Thu, Jun 11, 2009 at 10:22:40PM +0800, Wu Fengguang wrote:
> So as to eliminate one #ifdef in the c source.
> 
> Proposed by Nick Piggin.

Some older gccs didn't eliminate string constants for this,
please check you don't get the string in the object file with 
gcc 3.2 with the CONFIG disabled now.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order
  2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
  2009-06-11 15:59   ` Rik van Riel
@ 2009-06-12 10:03   ` Andi Kleen
  2009-06-12 10:07     ` Nick Piggin
  2009-06-12 13:27     ` Wu Fengguang
  1 sibling, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2009-06-12 10:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

On Thu, Jun 11, 2009 at 10:22:41PM +0800, Wu Fengguang wrote:
> To avoid possible deadlock. Proposed by Nick Piggin:

I disagree with the description. There's no possible deadlock right now.
It would be purely out of paranoia.

> 
>   You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
>   lock. And anon_vma lock nests inside i_mmap_lock.
> 
>   This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes

I was a bit dubious on this reasoning. If rwlocks become FIFO a lot of
stuff will likely break.

>   type (maybe -rt kernels do it), then you could have a task holding

I think they tried but backed off quickly again

It's ok with a less scare-mongering description.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file
  2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
  2009-06-11 16:31   ` Rik van Riel
@ 2009-06-12 10:07   ` Andi Kleen
  2009-06-12 13:41     ` Wu Fengguang
  1 sibling, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2009-06-12 10:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Hugh Dickins, Nick Piggin, Andi Kleen, riel,
	chris.mason, linux-mm

On Thu, Jun 11, 2009 at 10:22:43PM +0800, Wu Fengguang wrote:
> This makes the EIO reports on write(), fsync(), or the NFS close()
> sticky enough. The only way to get rid of it may be
> 
> 	echo 3 > /proc/sys/vm/drop_caches
> 
> Note that the impacted process will only be killed if it mapped the page.
> XXX
> via read()/write()/fsync() instead of memory mapped reads/writes, simply
> because it's very hard to find them.

I don't like the special case bit. Conceptually we shouldn't need
to handle hwpoison specially here; it's just like a standard error. 

It makes hwpoison look more intrusive than it really is :)

I think it would be better to simply make
the standard EIO sticky; that would fix a lot of other issues too (e.g.
better reporting of metadata errors) But that's something for post .31.

For .31 I think hwpoison can live fine with non sticky errors; it was
more a problem of the test suite anyways which we worked around.

So better drop this patch for now.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order
  2009-06-12 10:03   ` Andi Kleen
@ 2009-06-12 10:07     ` Nick Piggin
  2009-06-12 13:27     ` Wu Fengguang
  1 sibling, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-06-12 10:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wu Fengguang, Andrew Morton, LKML, Hugh Dickins, riel,
	chris.mason, linux-mm

On Fri, Jun 12, 2009 at 12:03:08PM +0200, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 10:22:41PM +0800, Wu Fengguang wrote:
> > To avoid possible deadlock. Proposed by Nick Piggin:
> 
> I disagree with the description. There's no possible deadlock right now.
> It would be purely out of paranoia.
> 
> > 
> >   You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> >   lock. And anon_vma lock nests inside i_mmap_lock.
> > 
> >   This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> 
> I was a bit dubious on this reasoning. If rwlocks become FIFO a lot of
> stuff will likely break.
> 
> >   type (maybe -rt kernels do it), then you could have a task holding
> 
> I think they tried but backed off quickly again
> 
> It's ok with a less scare-mongering description.

There's simply no good reason to invert ordering of locks like
this.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/5] [RFC] HWPOISON incremental fixes
  2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
                   ` (4 preceding siblings ...)
  2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
@ 2009-06-12 10:56 ` Andi Kleen
  2009-06-12 13:59   ` Wu Fengguang
  5 siblings, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2009-06-12 10:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Hugh Dickins, Nick Piggin, Andi Kleen, riel,
	chris.mason, linux-mm

On Thu, Jun 11, 2009 at 10:22:39PM +0800, Wu Fengguang wrote:
> Hi all,
> 
> Here are the hwpoison fixes that aims to address Nick and Hugh's concerns.
> Note that
> - the early kill option is dropped for .31. It's obscure option and complex
>   code and is not must have for .31. Maybe Andi also aims this option for
>   notifying KVM, but right now KVM is not ready to handle that.

KVM is ready to handle it, patches for that have been submitted and
are queued.

Also without early kill it's not really possible right now to recover
in the guest. Also for some other scenarios early kill is much easier
to handle than late kill: for late kill you always have to bail
out of your current execution context, while early kill that can be 
done out of line (e.g. by just dropping a corrupted object similar to 
what the kernel does). That's a much nicer and gentle model than late
kill.

Of course very few programs will try to handle this, but if any does
it's better to make it easier for them. 

That we send too many signals in a few cases is not fatal right now
I think. Remember always the alternative is to die completely.

So please don't drop that code right now.

> - It seems that even fsync() processes are not easy to catch, so I abandoned
>   the SIGKILL on fsync() idea. Instead, I choose to fail any attempt to
>   populate the poisoned file with new pages, so that the corrupted page offset
>   won't be repopulated with outdated data. This seems to be a safe way to allow
>   the process to continue running while still be able to promise good (but not
>   complete) data consistency.

The fsync() error reporting is already broken anyways, even without hwpoison,
for metadata errors which also only rely on the address space bit and not the
page and run into all the same problems.

I don't think we need to be better here than normal metadata.

Possibly if metadata can be fixed then hwpoison will be fixed too in the
same pass. But that's something longer term.

> - I didn't implement the PANIC-on-corrupted-data option. Instead, I guess
>   sending uevent notification to user space will be a more flexible scheme?

Normally you can get very aggressive panics by setting the x86 mce tolerant 
modus to 0 (default is 1); i suspect that will be good enough.

If other architectures add hwpoison support presumably they can add
a similar tunable.

Doing that in the low level handler is better than in the high level
VM because there are some corruption cases which are not reported
to high level (e.g. not affecting memory directly)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
  2009-06-11 15:44   ` Rik van Riel
  2009-06-12 10:00   ` Andi Kleen
@ 2009-06-12 11:22   ` Ingo Molnar
  2009-06-12 12:57     ` Wu Fengguang
  2 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-06-12 11:22 UTC (permalink / raw)
  To: Wu Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> So as to eliminate one #ifdef in the c source.
> 
> Proposed by Nick Piggin.
> 
> CC: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  arch/x86/mm/fault.c |    3 +--
>  include/linux/mm.h  |    7 ++++++-
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> --- sound-2.6.orig/arch/x86/mm/fault.c
> +++ sound-2.6/arch/x86/mm/fault.c
> @@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
>  	tsk->thread.error_code	= error_code;
>  	tsk->thread.trap_no	= 14;
>  
> -#ifdef CONFIG_MEMORY_FAILURE
>  	if (fault & VM_FAULT_HWPOISON) {
>  		printk(KERN_ERR
>  	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
>  			tsk->comm, tsk->pid, address);
>  		code = BUS_MCEERR_AR;
>  	}
> -#endif

Btw., anything like this should happen in close cooperation with the 
x86 tree, not as some pure MM feature. I dont see Cc:s and nothing 
that indicates that realization. What's going on here?

It is not at all clear to me whether propagating hardware failures 
this widely is desired from a general design POV. Most desktop 
hardware wont give a damn about this (and if a hardware fault 
happens you want to get as far from the crappy hardware as possible) 
so i'm not sure how relevant it is and how well tested it will 
become in practice.

I.e. really some wider discussion needs to happen on this.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 11:22   ` Ingo Molnar
@ 2009-06-12 12:57     ` Wu Fengguang
  2009-06-12 13:17       ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 12:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm

Hi Ingo,

On Fri, Jun 12, 2009 at 07:22:58PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > So as to eliminate one #ifdef in the c source.
> > 
> > Proposed by Nick Piggin.
> > 
> > CC: Nick Piggin <npiggin@suse.de>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  arch/x86/mm/fault.c |    3 +--
> >  include/linux/mm.h  |    7 ++++++-
> >  2 files changed, 7 insertions(+), 3 deletions(-)
> > 
> > --- sound-2.6.orig/arch/x86/mm/fault.c
> > +++ sound-2.6/arch/x86/mm/fault.c
> > @@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
> >  	tsk->thread.error_code	= error_code;
> >  	tsk->thread.trap_no	= 14;
> >  
> > -#ifdef CONFIG_MEMORY_FAILURE
> >  	if (fault & VM_FAULT_HWPOISON) {
> >  		printk(KERN_ERR
> >  	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> >  			tsk->comm, tsk->pid, address);
> >  		code = BUS_MCEERR_AR;
> >  	}
> > -#endif
> 
> Btw., anything like this should happen in close cooperation with the 
> x86 tree, not as some pure MM feature. I dont see Cc:s and nothing 
> that indicates that realization. What's going on here?

Ah sorry for the ignorance!  Andi has a nice overview of the big
picture here: http://lkml.org/lkml/2009/6/3/371

In the above chunk, the process is trying to access the already
corrupted page and thus shall be killed, otherwise it will either
silently consume corrupted data, or will trigger another (deadly)
MCE event and bring down the whole machine.

VM_FAULT_HWPOISON is tagged by the hwpoison code to indicate that the
previously mapped page contains corrupted data, and is unrecoverable
because there are no valid on-disk copy that can be reloaded.

> It is not at all clear to me whether propagating hardware failures 
> this widely is desired from a general design POV. Most desktop 
> hardware wont give a damn about this (and if a hardware fault 
> happens you want to get as far from the crappy hardware as possible) 
> so i'm not sure how relevant it is and how well tested it will 
> become in practice.

Intel Nehalem-EX will have this feature, and is going to ship in
volume servers in the coming years. Given that the servers may
well be equipped with tons of memory, memory failures (especially
soft errors http://en.wikipedia.org/wiki/Soft_error) become
un-ignorable.

Sunspot Maximum is underway by 2011 and we must be prepared for it ;)

> I.e. really some wider discussion needs to happen on this.

OK.


Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 10:00   ` Andi Kleen
@ 2009-06-12 13:15     ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 13:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, riel,
	chris.mason, linux-mm

On Fri, Jun 12, 2009 at 06:00:50PM +0800, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 10:22:40PM +0800, Wu Fengguang wrote:
> > So as to eliminate one #ifdef in the c source.
> > 
> > Proposed by Nick Piggin.
> 
> Some older gccs didn't eliminate string constants for this,
> please check you don't get the string in the object file with 
> gcc 3.2 with the CONFIG disabled now.

Well I don't have gcc 3.2 at hand and it's missing from apt source.. 

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 12:57     ` Wu Fengguang
@ 2009-06-12 13:17       ` Ingo Molnar
  2009-06-12 13:33         ` Wu Fengguang
                           ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-06-12 13:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm, Linus Torvalds


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> Hi Ingo,
> 
> On Fri, Jun 12, 2009 at 07:22:58PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > So as to eliminate one #ifdef in the c source.
> > > 
> > > Proposed by Nick Piggin.
> > > 
> > > CC: Nick Piggin <npiggin@suse.de>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  arch/x86/mm/fault.c |    3 +--
> > >  include/linux/mm.h  |    7 ++++++-
> > >  2 files changed, 7 insertions(+), 3 deletions(-)
> > > 
> > > --- sound-2.6.orig/arch/x86/mm/fault.c
> > > +++ sound-2.6/arch/x86/mm/fault.c
> > > @@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
> > >  	tsk->thread.error_code	= error_code;
> > >  	tsk->thread.trap_no	= 14;
> > >  
> > > -#ifdef CONFIG_MEMORY_FAILURE
> > >  	if (fault & VM_FAULT_HWPOISON) {
> > >  		printk(KERN_ERR
> > >  	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > >  			tsk->comm, tsk->pid, address);
> > >  		code = BUS_MCEERR_AR;
> > >  	}
> > > -#endif
> > 
> > Btw., anything like this should happen in close cooperation with 
> > the x86 tree, not as some pure MM feature. I dont see Cc:s and 
> > nothing that indicates that realization. What's going on here?
> 
> Ah sorry for the ignorance!  Andi has a nice overview of the big 
> picture here: http://lkml.org/lkml/2009/6/3/371
> 
> In the above chunk, the process is trying to access the already 
> corrupted page and thus shall be killed, otherwise it will either 
> silently consume corrupted data, or will trigger another (deadly) 
> MCE event and bring down the whole machine.

This seems like trying to handle a failure mode that cannot be and 
shouldnt be 'handled' really. If there's an 'already corrupted' page 
then the box should go down hard and fast, and we should not risk 
_even more user data corruption_ by trying to 'continue' in the hope 
of having hit some 'harmless' user process that can be killed ...

So i find the whole feature rather dubious - what's the point? We 
should panic at this point - we just corrupted user data so that 
piece of hardware cannot be trusted. Nor can any subsequent kernel 
bug messages be trusted.

Do we really want this in the core Linux VM and in the architecture 
pagefault handling code and elsewhere? Am i the only one who finds 
this concept of 'handling' user data corruption rather dubious?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order
  2009-06-12 10:03   ` Andi Kleen
  2009-06-12 10:07     ` Nick Piggin
@ 2009-06-12 13:27     ` Wu Fengguang
  2009-06-12 14:04       ` Wu Fengguang
  1 sibling, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, riel,
	chris.mason, linux-mm

On Fri, Jun 12, 2009 at 06:03:08PM +0800, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 10:22:41PM +0800, Wu Fengguang wrote:
> > To avoid possible deadlock. Proposed by Nick Piggin:
> 
> I disagree with the description. There's no possible deadlock right now.
> It would be purely out of paranoia.
> 
> > 
> >   You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> >   lock. And anon_vma lock nests inside i_mmap_lock.
> > 
> >   This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> 
> I was a bit dubious on this reasoning. If rwlocks become FIFO a lot of
> stuff will likely break.
> 
> >   type (maybe -rt kernels do it), then you could have a task holding
> 
> I think they tried but backed off quickly again
> 
> It's ok with a less scare-mongering description.

Why not merge it into the original patch and add a simple changelog
line there? I tried the last 6.5 patchset and it didn't apply cleanly
to the latest -mm tree. And this patch was updated:

---
HWPOISON: Refactor truncate to allow direct truncating of page v3
From: Nick Piggin <npiggin@suse.de>

Extract out truncate_inode_page() out of the truncate path so that
it can be used by memory-failure.c

[AK: description, headers, fix typos]
v2: Some white space changes from Fengguang Wu 
v3: add comments

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/mm.h |    2 ++
 mm/truncate.c      |   34 ++++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 12 deletions(-)

--- linux.orig/mm/truncate.c
+++ linux/mm/truncate.c
@@ -135,6 +135,26 @@ invalidate_complete_page(struct address_
 	return ret;
 }
 
+/*
+ * Remove one page from its pagecache mapping. The page must be locked.
+ * This does not truncate the file on disk, it performs the pagecache
+ * side of the truncate operation. Dirty data will be discarded, and
+ * concurrent page references are ignored.
+ *
+ * Generic mm/fs code cannot call this on filesystem metadata mappings
+ * because those can assume that a page reference is enough to pin the
+ * page to its mapping.
+ */
+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+	if (page_mapped(page)) {
+		unmap_mapping_range(mapping,
+				   (loff_t)page->index << PAGE_CACHE_SHIFT,
+				   PAGE_CACHE_SIZE, 0);
+	}
+	truncate_complete_page(mapping, page);
+}
+
 /**
  * truncate_inode_pages - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -196,12 +216,7 @@ void truncate_inode_pages_range(struct a
 				unlock_page(page);
 				continue;
 			}
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
-			truncate_complete_page(mapping, page);
+			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
@@ -238,15 +253,10 @@ void truncate_inode_pages_range(struct a
 				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				unmap_mapping_range(mapping,
-				  (loff_t)page->index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
-			}
+			truncate_inode_page(mapping, page);
 			if (page->index > next)
 				next = page->index;
 			next++;
-			truncate_complete_page(mapping, page);
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -808,6 +808,8 @@ static inline void unmap_shared_mapping_
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
+void truncate_inode_page(struct address_space *mapping, struct page *page);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 13:17       ` Ingo Molnar
@ 2009-06-12 13:33         ` Wu Fengguang
  2009-06-12 15:36           ` Ingo Molnar
  2009-06-12 13:58         ` Andi Kleen
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 13:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm, Linus Torvalds

On Fri, Jun 12, 2009 at 09:17:54PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Hi Ingo,
> > 
> > On Fri, Jun 12, 2009 at 07:22:58PM +0800, Ingo Molnar wrote:
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > So as to eliminate one #ifdef in the c source.
> > > > 
> > > > Proposed by Nick Piggin.
> > > > 
> > > > CC: Nick Piggin <npiggin@suse.de>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  arch/x86/mm/fault.c |    3 +--
> > > >  include/linux/mm.h  |    7 ++++++-
> > > >  2 files changed, 7 insertions(+), 3 deletions(-)
> > > > 
> > > > --- sound-2.6.orig/arch/x86/mm/fault.c
> > > > +++ sound-2.6/arch/x86/mm/fault.c
> > > > @@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
> > > >  	tsk->thread.error_code	= error_code;
> > > >  	tsk->thread.trap_no	= 14;
> > > >  
> > > > -#ifdef CONFIG_MEMORY_FAILURE
> > > >  	if (fault & VM_FAULT_HWPOISON) {
> > > >  		printk(KERN_ERR
> > > >  	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > > >  			tsk->comm, tsk->pid, address);
> > > >  		code = BUS_MCEERR_AR;
> > > >  	}
> > > > -#endif
> > > 
> > > Btw., anything like this should happen in close cooperation with 
> > > the x86 tree, not as some pure MM feature. I dont see Cc:s and 
> > > nothing that indicates that realization. What's going on here?
> > 
> > Ah sorry for the ignorance!  Andi has a nice overview of the big 
> > picture here: http://lkml.org/lkml/2009/6/3/371
> > 
> > In the above chunk, the process is trying to access the already 
> > corrupted page and thus shall be killed, otherwise it will either 
> > silently consume corrupted data, or will trigger another (deadly) 
> > MCE event and bring down the whole machine.
> 
> This seems like trying to handle a failure mode that cannot be and 
> shouldnt be 'handled' really. If there's an 'already corrupted' page 
> then the box should go down hard and fast, and we should not risk 
> _even more user data corruption_ by trying to 'continue' in the hope 
> of having hit some 'harmless' user process that can be killed ...
> 
> So i find the whole feature rather dubious - what's the point? We 
> should panic at this point - we just corrupted user data so that 
> piece of hardware cannot be trusted. Nor can any subsequent kernel 
> bug messages be trusted.
> 
> Do we really want this in the core Linux VM and in the architecture 
> pagefault handling code and elsewhere? Am i the only one who finds 
> this concept of 'handling' user data corruption rather dubious?

- The corrupted data only impacts one or more process(es)
- The corrupted data has not be consumed yet

The data corruption has not caused real hurt yet, and can be isolated
to prevent future accesses.  So it makes sense to just kill the
impacted process(es).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file
  2009-06-12 10:07   ` Andi Kleen
@ 2009-06-12 13:41     ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 13:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Hugh Dickins, Nick Piggin, riel,
	chris.mason, linux-mm

On Fri, Jun 12, 2009 at 06:07:16PM +0800, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 10:22:43PM +0800, Wu Fengguang wrote:
> > This makes the EIO reports on write(), fsync(), or the NFS close()
> > sticky enough. The only way to get rid of it may be
> > 
> > 	echo 3 > /proc/sys/vm/drop_caches
> > 
> > Note that the impacted process will only be killed if it mapped the page.
> > XXX
> > via read()/write()/fsync() instead of memory mapped reads/writes, simply
> > because it's very hard to find them.
> 
> I don't like the special case bit. Conceptually we shouldn't need
> to handle hwpoison specially here; it's just like a standard error. 
> 
> It makes hwpoison look more intrusive than it really is :)

This is already far more less intrusive than Nick's SIGKILL idea ;)

> I think it would be better to simply make
> the standard EIO sticky; that would fix a lot of other issues too (e.g.
> better reporting of metadata errors) But that's something for post .31.

Sure, fixing standard EIO is not the task of this patchset.

> For .31 I think hwpoison can live fine with non sticky errors; it was
> more a problem of the test suite anyways which we worked around.
> 
> So better drop this patch for now.

OK, if people in this list agree it to be intrusive ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 13:17       ` Ingo Molnar
  2009-06-12 13:33         ` Wu Fengguang
@ 2009-06-12 13:58         ` Andi Kleen
  2009-06-12 15:28         ` Linus Torvalds
  2009-06-12 15:45         ` Ingo Molnar
  3 siblings, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2009-06-12 13:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm, Linus Torvalds

> > In the above chunk, the process is trying to access the already 
> > corrupted page and thus shall be killed, otherwise it will either 
> > silently consume corrupted data, or will trigger another (deadly) 
> > MCE event and bring down the whole machine.
> 
> This seems like trying to handle a failure mode that cannot be and 
> shouldnt be 'handled' really. If there's an 'already corrupted' page 

I must slightly correct Fengguang, there is no silently consumed
corrupted data (at least not unless you disable machine checks, don't do
that). Or rather if the hardware cannot contain the error
it will definitely cause a panic and never call this.

Memory does have soft errors and the more memory you have the more
errors. Normally hardware hides that from you by correcting it, but
in some cases you can get multi-bit errors which lead to
uncorrected errors the hardware cannot hide.

This does not necessarily mean that the hardware is
broken; for example it can be caused by cosmic particles hitting
a unlucky transistor. So it can really happen in normal
operation.

The hardware contains these errors (it is marked "poisoned" in caches); 
if it was ever consumed there would be another machine check. 
The only problem is just that the other machine check couldn't be survived;
you would need to reboot.

But the hardware also supports telling the OS when it first detects
the error, which can be often before the error is consumed.

So hwpoison tries to get rid of the pages first when it can be safely
done with some help of the VM. This works because a lot of pages are 
"expendable", e.g. clean pages that can be just reloaded from disk or 
belonging to a process (so only that process is affected)

Another important use case is with virtualization: for a KVM guest you
only want to kill the guest that owns the affected memory, but not
the others. This is one of the features needed for KVM to catch up
with other hypervisors.

I should add it's generic infrastructure not only intended for x86, but
for generically for other architectures. e.g. we will get IA64 support
at some point and probably others too.

> then the box should go down hard and fast, and we should not risk 
> _even more user data corruption_ by trying to 'continue' in the hope 
> of having hit some 'harmless' user process that can be killed ...

Typically when there is a severe hardware failure (e.g. DIMM
completely dying) the box comes down pretty quickly due to multiple machine 
checks. I agree with you in this case panic is best. This code
does not really change that.

This infrastructure is more for handling standard error rates
where you can get an occasional error without panicing the box.
Panicing doesn't really help in this case.

> 
> So i find the whole feature rather dubious - what's the point? We 
> should panic at this point - we just corrupted user data so that 
> piece of hardware cannot be trusted. Nor can any subsequent kernel 

That's not an accurate description of what happens in real memory.
Even good memory has a error rate, how large it is depends on the 
environment. And obviously the error frequency rises with more DIMMs:
more transistors = more potential errors which also means more potential
uncorrected errors.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/5] [RFC] HWPOISON incremental fixes
  2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
@ 2009-06-12 13:59   ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 13:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Hugh Dickins, Nick Piggin, riel,
	chris.mason, linux-mm

On Fri, Jun 12, 2009 at 06:56:10PM +0800, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 10:22:39PM +0800, Wu Fengguang wrote:
> > Hi all,
> > 
> > Here are the hwpoison fixes that aims to address Nick and Hugh's concerns.
> > Note that
> > - the early kill option is dropped for .31. It's obscure option and complex
> >   code and is not must have for .31. Maybe Andi also aims this option for
> >   notifying KVM, but right now KVM is not ready to handle that.
> 
> KVM is ready to handle it, patches for that have been submitted and
> are queued.

Ah OK, with that we'd better to have early kill for .31.

> Also without early kill it's not really possible right now to recover
> in the guest. Also for some other scenarios early kill is much easier
> to handle than late kill: for late kill you always have to bail
> out of your current execution context, while early kill that can be 
> done out of line (e.g. by just dropping a corrupted object similar to 
> what the kernel does). That's a much nicer and gentle model than late
> kill.
> 
> Of course very few programs will try to handle this, but if any does
> it's better to make it easier for them. 

For KVM it's important to send the notification ASAP. Otherwise the
corrupted guest pages may get referenced or go to IO. The same holds
true for applications that do a lot of caching by itself, eg. the big
databases.

> That we send too many signals in a few cases is not fatal right now
> I think. Remember always the alternative is to die completely.

Yeah the possibility of killing innocent processes should be small.

> So please don't drop that code right now.

OK.

> 
> > - It seems that even fsync() processes are not easy to catch, so I abandoned
> >   the SIGKILL on fsync() idea. Instead, I choose to fail any attempt to
> >   populate the poisoned file with new pages, so that the corrupted page offset
> >   won't be repopulated with outdated data. This seems to be a safe way to allow
> >   the process to continue running while still be able to promise good (but not
> >   complete) data consistency.
> 
> The fsync() error reporting is already broken anyways, even without hwpoison,
> for metadata errors which also only rely on the address space bit and not the
> page and run into all the same problems.
> 
> I don't think we need to be better here than normal metadata.
> 
> Possibly if metadata can be fixed then hwpoison will be fixed too in the
> same pass. But that's something longer term.

Yes I admit setting sticky EIO on bdev will be disastrous..

> > - I didn't implement the PANIC-on-corrupted-data option. Instead, I guess
> >   sending uevent notification to user space will be a more flexible scheme?
> 
> Normally you can get very aggressive panics by setting the x86 mce tolerant 
> modus to 0 (default is 1); i suspect that will be good enough.

OK.

> If other architectures add hwpoison support presumably they can add
> a similar tunable.
> 
> Doing that in the low level handler is better than in the high level
> VM because there are some corruption cases which are not reported
> to high level (e.g. not affecting memory directly)

I have worked out some basic code to do the uevent notification :)

It looks like this (the hwpoison_control bits shall be separated out).
Basically it sends the following vars:

+       add_uevent_var(hpc->env, "EVENT=poison");
+       add_uevent_var(hpc->env, "PFN=%#lx", hpc->pfn);
+       add_uevent_var(hpc->env, "PAGE_FLAGS=%#Lx", page_uflags(p));
+       add_uevent_var(hpc->env, "PAGE_COUNT=%d", page_count(p));
+       add_uevent_var(hpc->env, "PAGE_MAPCOUNT=%d", page_mapcount(p));
+               add_uevent_var(hpc->env, "PAGE_DEV=%02x:%02x",
+               add_uevent_var(hpc->env, "PAGE_INODE=%lu", mapping->host->i_ino);
+               add_uevent_var(hpc->env, "PAGE_INDEX=%lu", hpc->page->index);
+       add_uevent_var(hpc->env, "RESULT=%s", action_name[hpc->result]);

Thanks,
Fengguang

---
 mm/memory-failure.c |  137 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 121 insertions(+), 16 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -68,6 +68,54 @@ static const char *action_name[] = {
 	[RECOVERED] = "Recovered",
 };
 
+static struct kset *hwpoison_kset;
+static struct kobject hwpoison_kobj;
+
+struct hwpoison_uevent_env {
+	char *vars[16];
+	char content[200];
+	int count;
+	int index;
+};
+
+struct hwpoison_control {
+	struct kobj_uevent_env *env;
+	unsigned long pfn;
+	struct page *page;
+	int result;
+};
+
+static void hwpoison_uevent_page(struct hwpoison_control *hpc)
+{
+	struct page *p = compound_head(hpc->page);
+
+	add_uevent_var(hpc->env, "EVENT=poison");
+	add_uevent_var(hpc->env, "PFN=%#lx", hpc->pfn);
+
+	add_uevent_var(hpc->env, "PAGE_FLAGS=%#Lx", page_uflags(p));
+	add_uevent_var(hpc->env, "PAGE_COUNT=%d", page_count(p));
+	add_uevent_var(hpc->env, "PAGE_MAPCOUNT=%d", page_mapcount(p));
+}
+
+static void hwpoison_uevent_file(struct hwpoison_control *hpc)
+{
+	struct address_space *mapping = page_mapping(hpc->page);
+
+	if (mapping && mapping->host) {
+		add_uevent_var(hpc->env, "PAGE_DEV=%02x:%02x",
+			       MAJOR(mapping->host->i_sb->s_dev),
+			       MINOR(mapping->host->i_sb->s_dev));
+		add_uevent_var(hpc->env, "PAGE_INODE=%lu", mapping->host->i_ino);
+		add_uevent_var(hpc->env, "PAGE_INDEX=%lu", hpc->page->index);
+	}
+}
+
+static void hwpoison_uevent_send(struct hwpoison_control *hpc)
+{
+	add_uevent_var(hpc->env, "RESULT=%s", action_name[hpc->result]);
+	kobject_uevent_env(&hwpoison_kobj, KOBJ_CHANGE, hpc->env->envp);
+}
+
 /*
  * Error hit kernel page.
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
@@ -323,24 +371,24 @@ static struct page_state {
 	{ 0,		0,		"unknown page state", me_unknown },
 };
 
-static void action_result(unsigned long pfn, char *msg, int result)
+static void action_result(struct hwpoison_control *hpc, char *msg, int result)
 {
+	hpc->result = result;
 	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
-		pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
+		hpc->pfn, PageDirty(hpc->page) ? "dirty " : "",
 		msg, action_name[result]);
 }
 
-static void page_action(struct page_state *ps, struct page *p,
-			unsigned long pfn)
+static void page_action(struct page_state *ps, struct hwpoison_control *hpc)
 {
-	int result;
+	struct page *p = hpc->page;
 
-	result = ps->action(p, pfn);
-	action_result(pfn, ps->msg, result);
+	hpc->result = ps->action(p, hpc->pfn);
+	action_result(hpc, ps->msg, hpc->result);
 	if (page_count(p) != 1)
 		printk(KERN_ERR
 		       "MCE %#lx: %s page still referenced by %d users\n",
-		       pfn, ps->msg, page_count(p) - 1);
+		       hpc->pfn, ps->msg, page_count(p) - 1);
 
 	/* Could do more checks here if page looks ok */
 	atomic_long_add(1, &mce_bad_pages);
@@ -437,6 +485,13 @@ void memory_failure(unsigned long pfn, i
 {
 	struct page_state *ps;
 	struct page *p;
+	struct hwpoison_control hpc;
+
+	hpc.env = kzalloc(sizeof(struct kobj_uevent_env), GFP_NOIO);
+	if (!hpc.env) {
+		printk(KERN_ERR "MCE %#lx: out of memory: Failed\n", pfn);
+		return;
+	}
 
 	if (!pfn_valid(pfn)) {
 		printk(KERN_ERR
@@ -446,9 +501,13 @@ void memory_failure(unsigned long pfn, i
 	}
 
 	p = pfn_to_page(pfn);
+	hpc.pfn = pfn;
+	hpc.page = p;
+	hwpoison_uevent_page(&hpc);
+
 	if (TestSetPageHWPoison(p)) {
-		action_result(pfn, "already hardware poisoned", IGNORED);
-		return;
+		action_result(&hpc, "already hardware poisoned", IGNORED);
+		goto out;
 	}
 
 	/*
@@ -463,8 +522,8 @@ void memory_failure(unsigned long pfn, i
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!get_page_unless_zero(compound_head(p))) {
-		action_result(pfn, "free or high order kernel", IGNORED);
-		return;
+		action_result(&hpc, "free or high order kernel", IGNORED);
+		goto out;
 	}
 
 	/*
@@ -484,16 +543,62 @@ void memory_failure(unsigned long pfn, i
 	 * Torn down by someone else?
 	 */
 	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
-		action_result(pfn, "already truncated LRU", IGNORED);
-		goto out;
+		action_result(&hpc, "already truncated LRU", IGNORED);
+		goto out_unlock;
 	}
 
+	hwpoison_uevent_file(&hpc);
+
 	for (ps = error_states;; ps++) {
 		if ((p->flags & ps->mask) == ps->res) {
-			page_action(ps, p, pfn);
+			page_action(ps, &hpc);
 			break;
 		}
 	}
-out:
+out_unlock:
 	unlock_page(p);
+out:
+	hwpoison_uevent_send(&hpc);
+}
+
+static void hwpoison_release(struct kobject *kobj)
+{
+}
+
+static struct kobj_type hwpoison_ktype = {
+	.release = hwpoison_release,
+};
+
+static int create_hwpoison_obj(void)
+{
+	int err;
+
+	hwpoison_kset = kset_create_and_add("hwpoison", NULL, mm_kobj);
+	if (!hwpoison_kset)
+		return -ENOMEM;
+
+	hwpoison_kobj.kset = hwpoison_kset;
+
+	err = kobject_init_and_add(&hwpoison_kobj, &hwpoison_ktype, NULL,
+				   "hwpoison");
+	if (err)
+		return -ENOMEM;
+
+	kobject_uevent(&hwpoison_kobj, KOBJ_ADD);
+
+	return 0;
 }
+
+
+static int __init hwpoison_init(void)
+{
+	return create_hwpoison_obj();
+}
+
+static void __exit hwpoison_exit(void)
+{
+	kset_unregister(hwpoison_kset);
+}
+
+module_init(hwpoison_init);
+module_exit(hwpoison_exit);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order
  2009-06-12 13:27     ` Wu Fengguang
@ 2009-06-12 14:04       ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 14:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, LKML, Nick Piggin, Hugh Dickins, riel,
	chris.mason, linux-mm

On Fri, Jun 12, 2009 at 09:27:14PM +0800, Wu Fengguang wrote:
> On Fri, Jun 12, 2009 at 06:03:08PM +0800, Andi Kleen wrote:
> > On Thu, Jun 11, 2009 at 10:22:41PM +0800, Wu Fengguang wrote:
> > > To avoid possible deadlock. Proposed by Nick Piggin:
> > 
> > I disagree with the description. There's no possible deadlock right now.
> > It would be purely out of paranoia.
> > 
> > > 
> > >   You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
> > >   lock. And anon_vma lock nests inside i_mmap_lock.
> > > 
> > >   This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
> > 
> > I was a bit dubious on this reasoning. If rwlocks become FIFO a lot of
> > stuff will likely break.
> > 
> > >   type (maybe -rt kernels do it), then you could have a task holding
> > 
> > I think they tried but backed off quickly again
> > 
> > It's ok with a less scare-mongering description.
> 
> Why not merge it into the original patch and add a simple changelog
> line there? I tried the last 6.5 patchset and it didn't apply cleanly
> to the latest -mm tree. And this patch was updated:

Andy, please apply this patch too. It fixed a kernel panic: when
invalid PFN is feed to the debug injection interface, action_result()
will try to do a pfn_to_page() on it..

Thanks,
Fengguang

---
 mm/memory-failure.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -439,7 +439,9 @@ void memory_failure(unsigned long pfn, i
 	struct page *p;
 
 	if (!pfn_valid(pfn)) {
-		action_result(pfn, "memory outside kernel control", IGNORED);
+		printk(KERN_ERR
+		       "MCE %#lx: memory outside kernel control: Ignored\n",
+		       pfn);
 		return;
 	}
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 13:17       ` Ingo Molnar
  2009-06-12 13:33         ` Wu Fengguang
  2009-06-12 13:58         ` Andi Kleen
@ 2009-06-12 15:28         ` Linus Torvalds
  2009-06-12 15:35           ` Ingo Molnar
  2009-06-12 15:45         ` Ingo Molnar
  3 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2009-06-12 15:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm



On Fri, 12 Jun 2009, Ingo Molnar wrote:
> 
> This seems like trying to handle a failure mode that cannot be and 
> shouldnt be 'handled' really. If there's an 'already corrupted' page 
> then the box should go down hard and fast, and we should not risk 
> _even more user data corruption_ by trying to 'continue' in the hope 
> of having hit some 'harmless' user process that can be killed ...

No, the box should _not_ go down hard-and-fast. That's the last thing we 
should *ever* do.

We need to log it. Often at a user level (ie we want to make sure it 
actually hits syslog, possibly goes out the network, maybe pops up a 
window, whatever).

Shutting down the machine is the last thing we ever want to do. 

The whole "let's panic" mentality is a disease.

		Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:28         ` Linus Torvalds
@ 2009-06-12 15:35           ` Ingo Molnar
  2009-06-12 16:05             ` Rik van Riel
                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-06-12 15:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wu Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > 
> > This seems like trying to handle a failure mode that cannot be 
> > and shouldnt be 'handled' really. If there's an 'already 
> > corrupted' page then the box should go down hard and fast, and 
> > we should not risk _even more user data corruption_ by trying to 
> > 'continue' in the hope of having hit some 'harmless' user 
> > process that can be killed ...
> 
> No, the box should _not_ go down hard-and-fast. That's the last 
> thing we should *ever* do.
> 
> We need to log it. Often at a user level (ie we want to make sure 
> it actually hits syslog, possibly goes out the network, maybe pops 
> up a window, whatever).
> 
> Shutting down the machine is the last thing we ever want to do.
> 
> The whole "let's panic" mentality is a disease.

No doubt about that - and i'm removing BUG_ON()s and panic()s 
wherever i can and havent added a single new one myself in the past 
5 years or so, its a disease.

If a fault hits a harmless piece of the system, then the log message 
will make it out and people know what happened. hwpoison does not 
affect that at all. If the fault hits the critical path towards 
gettig the log message out - then we wont get a log message, 
hwpoison or not.

My point is that hwpoison allows the _ignoring_ of hardware problems 
and thus pushes more buggy hardware up the pipeline.

Clusters will be running with this under the (false IMO) assumption 
that the kernel will tell the admin when something bad happened and 
the machine can limp along otherwise.

So i think hwpoison simply does not affect our ability to get log 
messages out - but it sure allows crappier hardware to be used.
Am i wrong about that for some reason?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 13:33         ` Wu Fengguang
@ 2009-06-12 15:36           ` Ingo Molnar
  2009-06-12 16:14             ` Wu Fengguang
  2009-06-12 17:55             ` Theodore Tso
  0 siblings, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-06-12 15:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm, Linus Torvalds

* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Fri, Jun 12, 2009 at 09:17:54PM +0800, Ingo Molnar wrote:
> > 
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Hi Ingo,
> > > 
> > > On Fri, Jun 12, 2009 at 07:22:58PM +0800, Ingo Molnar wrote:
> > > > 
> > > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > 
> > > > > So as to eliminate one #ifdef in the c source.
> > > > > 
> > > > > Proposed by Nick Piggin.
> > > > > 
> > > > > CC: Nick Piggin <npiggin@suse.de>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > ---
> > > > >  arch/x86/mm/fault.c |    3 +--
> > > > >  include/linux/mm.h  |    7 ++++++-
> > > > >  2 files changed, 7 insertions(+), 3 deletions(-)
> > > > > 
> > > > > --- sound-2.6.orig/arch/x86/mm/fault.c
> > > > > +++ sound-2.6/arch/x86/mm/fault.c
> > > > > @@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
> > > > >  	tsk->thread.error_code	= error_code;
> > > > >  	tsk->thread.trap_no	= 14;
> > > > >  
> > > > > -#ifdef CONFIG_MEMORY_FAILURE
> > > > >  	if (fault & VM_FAULT_HWPOISON) {
> > > > >  		printk(KERN_ERR
> > > > >  	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > > > >  			tsk->comm, tsk->pid, address);
> > > > >  		code = BUS_MCEERR_AR;
> > > > >  	}
> > > > > -#endif
> > > > 
> > > > Btw., anything like this should happen in close cooperation with 
> > > > the x86 tree, not as some pure MM feature. I dont see Cc:s and 
> > > > nothing that indicates that realization. What's going on here?
> > > 
> > > Ah sorry for the ignorance!  Andi has a nice overview of the big 
> > > picture here: http://lkml.org/lkml/2009/6/3/371
> > > 
> > > In the above chunk, the process is trying to access the already 
> > > corrupted page and thus shall be killed, otherwise it will either 
> > > silently consume corrupted data, or will trigger another (deadly) 
> > > MCE event and bring down the whole machine.
> > 
> > This seems like trying to handle a failure mode that cannot be and 
> > shouldnt be 'handled' really. If there's an 'already corrupted' page 
> > then the box should go down hard and fast, and we should not risk 
> > _even more user data corruption_ by trying to 'continue' in the hope 
> > of having hit some 'harmless' user process that can be killed ...
> > 
> > So i find the whole feature rather dubious - what's the point? We 
> > should panic at this point - we just corrupted user data so that 
> > piece of hardware cannot be trusted. Nor can any subsequent kernel 
> > bug messages be trusted.
> > 
> > Do we really want this in the core Linux VM and in the architecture 
> > pagefault handling code and elsewhere? Am i the only one who finds 
> > this concept of 'handling' user data corruption rather dubious?
> 
> - The corrupted data only impacts one or more process(es)
> - The corrupted data has not be consumed yet
> 
> The data corruption has not caused real hurt yet, and can be 
> isolated to prevent future accesses.  So it makes sense to just 
> kill the impacted process(es).

Dunno, this just looks like a license to allow more crappy hardware, 
hm? I'm all for _logging_ errors, but hwpoison is not about that: it 
is about allowing the hardware to limp along in 'enterprise' setups, 
with a (false looking) 'guarantee' that everything is fine.

There's no guarantee that the fault doesnt hit something critical - 
and by allowing 'harmless' faults we push up the noise level.

Any move from us to make faulty hardware more acceptable by 
"handling" it in a percentage of cases (and crashing/corrupting in 
other cases) is futile IMHO - it just sends the wrong general 
message.

I.e. i think this thinking misses the general harm on for example 
the quality of kernel bugreports: if such a system corrupts memory, 
and crashes in a weird way - we'll get a weird kernel-crash report. 
If it 'only' corrupts some user process in a 'harmless' way, we wont 
get a crash report. Say the kernel crashes in 10% of the cases, 
user-space crashes in 90% of the cases.

If we allow that 90% to continue, we make the 10% "bad" crash 
proportion more prominent in our stats too. I.e. by allowing 
'harmless' bugs to be more acceptable in practice, we indirectly 
increase the proportion of _bad_ crashes as well.

Do you accept that general point or am i wrong?

Computing along the von Neumann principles really depends on having 
a sufficiently well working piece of hardware that one can trust 
with a reasonable certainty. Probabilistic computing is fine too in 
certain isolated fields where you say want some probabilistic result 
to begin with (say the result of some property of the physical 
world) - but in general purpose hardware i doubt it's the right kind 
of approach ...

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 13:17       ` Ingo Molnar
                           ` (2 preceding siblings ...)
  2009-06-12 15:28         ` Linus Torvalds
@ 2009-06-12 15:45         ` Ingo Molnar
  2009-06-12 16:12           ` Linus Torvalds
  3 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-06-12 15:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm, Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> So i find the whole feature rather dubious - what's the point? We 
> should panic at this point - we just corrupted user data so that 
> piece of hardware cannot be trusted. Nor can any subsequent kernel 
> bug messages be trusted.

s/panic/print nasty message

Not sure what i was smoking there. Sysadmin will panic from a memory 
corruption message anyway.

What feels wrong to me is: kill a process and - in the case where 
that process is restartable or expendable - suggest to the admin 
that "everything is fine, we handled it just fine".

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:35           ` Ingo Molnar
@ 2009-06-12 16:05             ` Rik van Riel
  2009-06-12 16:37             ` H. Peter Anvin
  2009-06-15  6:52             ` Nick Piggin
  2 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2009-06-12 16:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Wu Fengguang, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, Nick Piggin, Hugh Dickins,
	Andi Kleen, chris.mason, linux-mm

Ingo Molnar wrote:

> So i think hwpoison simply does not affect our ability to get log 
> messages out - but it sure allows crappier hardware to be used.
> Am i wrong about that for some reason?

You are :)

A 2-bit memory error can be a temporary failure, eg.
due to a cosmic ray.  If bit errors could be prevented
in hardware, there would be no reason to have ECC at all.

The only reason to stop using that page is because we
do not know for sure whether the error was temporary
or permanent (or dependent on a particular bit pattern).

Userspace needs to be notified that some data disappeared,
if it did - for clean pagecache and swap cache pages, the
kernel can simply take the page away and wait for a page
fault...

The sysadmin needs to know that something happened too,
because the hardware *might* have a problem.

However, a 2-bit error does not imply that the hardware
actually needs to be replaced.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:45         ` Ingo Molnar
@ 2009-06-12 16:12           ` Linus Torvalds
  0 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2009-06-12 16:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm



On Fri, 12 Jun 2009, Ingo Molnar wrote:
> 
> What feels wrong to me is: kill a process and - in the case where 
> that process is restartable or expendable - suggest to the admin 
> that "everything is fine, we handled it just fine".

That is absolutely _not_ how any real-life sysadmin would ever react.

				Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:36           ` Ingo Molnar
@ 2009-06-12 16:14             ` Wu Fengguang
  2009-06-12 18:07               ` Alan Cox
  2009-06-12 17:55             ` Theodore Tso
  1 sibling, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2009-06-12 16:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel, chris.mason,
	linux-mm, Linus Torvalds

On Fri, Jun 12, 2009 at 11:36:20PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Fri, Jun 12, 2009 at 09:17:54PM +0800, Ingo Molnar wrote:
> > > 
> > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Hi Ingo,
> > > > 
> > > > On Fri, Jun 12, 2009 at 07:22:58PM +0800, Ingo Molnar wrote:
> > > > > 
> > > > > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > > 
> > > > > > So as to eliminate one #ifdef in the c source.
> > > > > > 
> > > > > > Proposed by Nick Piggin.
> > > > > > 
> > > > > > CC: Nick Piggin <npiggin@suse.de>
> > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > > ---
> > > > > >  arch/x86/mm/fault.c |    3 +--
> > > > > >  include/linux/mm.h  |    7 ++++++-
> > > > > >  2 files changed, 7 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > --- sound-2.6.orig/arch/x86/mm/fault.c
> > > > > > +++ sound-2.6/arch/x86/mm/fault.c
> > > > > > @@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
> > > > > >  	tsk->thread.error_code	= error_code;
> > > > > >  	tsk->thread.trap_no	= 14;
> > > > > >  
> > > > > > -#ifdef CONFIG_MEMORY_FAILURE
> > > > > >  	if (fault & VM_FAULT_HWPOISON) {
> > > > > >  		printk(KERN_ERR
> > > > > >  	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > > > > >  			tsk->comm, tsk->pid, address);
> > > > > >  		code = BUS_MCEERR_AR;
> > > > > >  	}
> > > > > > -#endif
> > > > > 
> > > > > Btw., anything like this should happen in close cooperation with 
> > > > > the x86 tree, not as some pure MM feature. I dont see Cc:s and 
> > > > > nothing that indicates that realization. What's going on here?
> > > > 
> > > > Ah sorry for the ignorance!  Andi has a nice overview of the big 
> > > > picture here: http://lkml.org/lkml/2009/6/3/371
> > > > 
> > > > In the above chunk, the process is trying to access the already 
> > > > corrupted page and thus shall be killed, otherwise it will either 
> > > > silently consume corrupted data, or will trigger another (deadly) 
> > > > MCE event and bring down the whole machine.
> > > 
> > > This seems like trying to handle a failure mode that cannot be and 
> > > shouldnt be 'handled' really. If there's an 'already corrupted' page 
> > > then the box should go down hard and fast, and we should not risk 
> > > _even more user data corruption_ by trying to 'continue' in the hope 
> > > of having hit some 'harmless' user process that can be killed ...
> > > 
> > > So i find the whole feature rather dubious - what's the point? We 
> > > should panic at this point - we just corrupted user data so that 
> > > piece of hardware cannot be trusted. Nor can any subsequent kernel 
> > > bug messages be trusted.
> > > 
> > > Do we really want this in the core Linux VM and in the architecture 
> > > pagefault handling code and elsewhere? Am i the only one who finds 
> > > this concept of 'handling' user data corruption rather dubious?
> > 
> > - The corrupted data only impacts one or more process(es)
> > - The corrupted data has not be consumed yet
> > 
> > The data corruption has not caused real hurt yet, and can be 
> > isolated to prevent future accesses.  So it makes sense to just 
> > kill the impacted process(es).
> 
> Dunno, this just looks like a license to allow more crappy hardware, 
> hm? I'm all for _logging_ errors, but hwpoison is not about that: it 
> is about allowing the hardware to limp along in 'enterprise' setups, 
> with a (false looking) 'guarantee' that everything is fine.
> 
> There's no guarantee that the fault doesnt hit something critical - 
> and by allowing 'harmless' faults we push up the noise level.
> 
> Any move from us to make faulty hardware more acceptable by 
> "handling" it in a percentage of cases (and crashing/corrupting in 
> other cases) is futile IMHO - it just sends the wrong general 
> message.
> 
> I.e. i think this thinking misses the general harm on for example 
> the quality of kernel bugreports: if such a system corrupts memory, 
> and crashes in a weird way - we'll get a weird kernel-crash report. 
> If it 'only' corrupts some user process in a 'harmless' way, we wont 
> get a crash report. Say the kernel crashes in 10% of the cases, 
> user-space crashes in 90% of the cases.
> 
> If we allow that 90% to continue, we make the 10% "bad" crash 
> proportion more prominent in our stats too. I.e. by allowing 
> 'harmless' bugs to be more acceptable in practice, we indirectly 
> increase the proportion of _bad_ crashes as well.
> 
> Do you accept that general point or am i wrong?
> 
> Computing along the von Neumann principles really depends on having 
> a sufficiently well working piece of hardware that one can trust 
> with a reasonable certainty. Probabilistic computing is fine too in 
> certain isolated fields where you say want some probabilistic result 
> to begin with (say the result of some property of the physical 
> world) - but in general purpose hardware i doubt it's the right kind 
> of approach ...

NAND flash is crappy - it is continuously rotting - it's wrong to
encourage its usage by inventing wear leveling and checksum algorithms
and to make SSD on top of them.

wireless network is crappy - it so much more unreliable than fibre networks.

PC servers are crappy - google invented the google file system? Damn it!


HWPOISON is a reliability enabling feature - if it enables prevalent
of crappy hardwares, let's celebrate changing the world~~

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:35           ` Ingo Molnar
  2009-06-12 16:05             ` Rik van Riel
@ 2009-06-12 16:37             ` H. Peter Anvin
  2009-06-12 16:48               ` Ingo Molnar
  2009-06-15  7:04               ` Nick Piggin
  2009-06-15  6:52             ` Nick Piggin
  2 siblings, 2 replies; 42+ messages in thread
From: H. Peter Anvin @ 2009-06-12 16:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Wu Fengguang, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm

Ingo Molnar wrote:
> 
> So i think hwpoison simply does not affect our ability to get log 
> messages out - but it sure allows crappier hardware to be used.
> Am i wrong about that for some reason?
> 

Crappy hardware isn't the kind of hardware that is likely to have the
hwpoison features, just like crappy hardware generally doesn't even have
ECC -- or even basic parity checking (I personally think non-ECC memory
should be considered a crime against humanity in this day and age.)

You're making the fundamental assumption that failover and hardware
replacement is a relatively cheap and fast operation.  In high
reliability applications, of course, failover is always an option -- it
*HAS* to be an option -- but that doesn't mean that hardware replacement
is cheap, fast or even possible -- and now you've blown your failover
option.

These kinds of features are used when extremely high reliability is
required, think for example a telco core router.  A page error may have
happened due to stray radiation or through power supply glitches (which
happen even in the best of systems), but if they are a pattern, a box
needs to be replaced.  *How quickly* a box can be taken out of service
and replaced can vary greatly, and its urgency depend on patterns;
furthermore, in the meantime the device has to work the best it can.

Consider, for example, a control computer on the Hubble Space Telescope
-- the only way to replace it is by space shuttle, and you can safely
guarantee that *that* won't happen in a heartbeat.  On the new Herschel
Space Observatory, not even the space shuttle can help: if the computers
die, *or* if bad data gets fed to its control system, the spacecraft is
lost.  As such, it's of paramount importance for the computers to (a)
continue to provide service at the level the hardware is capable of
doing, (b) as accurately as possible continually assess and report that
level of service, and (c) not allow a failure to pass undetected.  A lot
of failures are simple one-time events (especially in space, a high-rad
environment), others reflect decaying hardware but can be isolated (e.g.
a RAM cell which has developed a short circuit, or a CPU core which has
a damaged ALU), while others yet reflect a general ill health of the
system that cannot be recovered.

What these kinds of features do is it gives the overall-system designers
and the administrators more options.

(Note: this is an hpa position statement, not necessarily an Intel one.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 16:37             ` H. Peter Anvin
@ 2009-06-12 16:48               ` Ingo Molnar
  2009-06-15  7:04               ` Nick Piggin
  1 sibling, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-06-12 16:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Wu Fengguang, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm


* H. Peter Anvin <hpa@zytor.com> wrote:

> Ingo Molnar wrote:
> > 
> > So i think hwpoison simply does not affect our ability to get 
> > log messages out - but it sure allows crappier hardware to be 
> > used. Am i wrong about that for some reason?
> 
> Crappy hardware isn't the kind of hardware that is likely to have 
> the hwpoison features, just like crappy hardware generally doesn't 
> even have ECC -- or even basic parity checking (I personally think 
> non-ECC memory should be considered a crime against humanity in 
> this day and age.)
> 
> You're making the fundamental assumption that failover and 
> hardware replacement is a relatively cheap and fast operation.  In 
> high reliability applications, of course, failover is always an 
> option -- it *HAS* to be an option -- but that doesn't mean that 
> hardware replacement is cheap, fast or even possible -- and now 
> you've blown your failover option.
> 
> These kinds of features are used when extremely high reliability 
> is required, think for example a telco core router.  A page error 
> may have happened due to stray radiation or through power supply 
> glitches (which happen even in the best of systems), but if they 
> are a pattern, a box needs to be replaced.  *How quickly* a box 
> can be taken out of service and replaced can vary greatly, and its 
> urgency depend on patterns; furthermore, in the meantime the 
> device has to work the best it can.
> 
> Consider, for example, a control computer on the Hubble Space 
> Telescope -- the only way to replace it is by space shuttle, and 
> you can safely guarantee that *that* won't happen in a heartbeat.  
> On the new Herschel Space Observatory, not even the space shuttle 
> can help: if the computers die, *or* if bad data gets fed to its 
> control system, the spacecraft is lost.  As such, it's of 
> paramount importance for the computers to (a) continue to provide 
> service at the level the hardware is capable of doing, (b) as 
> accurately as possible continually assess and report that level of 
> service, and (c) not allow a failure to pass undetected.  A lot of 
> failures are simple one-time events (especially in space, a 
> high-rad environment), others reflect decaying hardware but can be 
> isolated (e.g. a RAM cell which has developed a short circuit, or 
> a CPU core which has a damaged ALU), while others yet reflect a 
> general ill health of the system that cannot be recovered.
> 
> What these kinds of features do is it gives the overall-system 
> designers and the administrators more options.

Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:36           ` Ingo Molnar
  2009-06-12 16:14             ` Wu Fengguang
@ 2009-06-12 17:55             ` Theodore Tso
  1 sibling, 0 replies; 42+ messages in thread
From: Theodore Tso @ 2009-06-12 17:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm, Linus Torvalds

On Fri, Jun 12, 2009 at 05:36:20PM +0200, Ingo Molnar wrote:
> > The data corruption has not caused real hurt yet, and can be 
> > isolated to prevent future accesses.  So it makes sense to just 
> > kill the impacted process(es).
> 
> Dunno, this just looks like a license to allow more crappy hardware, 
> hm? I'm all for _logging_ errors, but hwpoison is not about that: it 
> is about allowing the hardware to limp along in 'enterprise' setups, 
> with a (false looking) 'guarantee' that everything is fine.

This should be tunable; in some cases, logging it is the right thing
to do; I imagine that in the case of the desktop OS, the user would
appreciate being given *some* chance to save the document he or she
has spent the past hour working on before the system goes down "hard
and fast".

In other cases, the sysadmin is using a high-availability setup in an
enterprise deployment, and there he or she would want the system to
immediately shutdown so the hot standby can take over.

	    	     	    		    	 - Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 16:14             ` Wu Fengguang
@ 2009-06-12 18:07               ` Alan Cox
  0 siblings, 0 replies; 42+ messages in thread
From: Alan Cox @ 2009-06-12 18:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Andrew Morton, LKML, Nick Piggin, Hugh Dickins, Andi Kleen, riel,
	chris.mason, linux-mm, Linus Torvalds

> HWPOISON is a reliability enabling feature - if it enables prevalent
> of crappy hardwares, let's celebrate changing the world~~

HWPOISON is not in the most part a reliability enabling feature. Nothing
of the sort.

The existing behaviour is that your system goes kerblam on such a serious
error. The replacement behaviour is that bits of your machine go kerblam
in unpredictable ways.

In both cases you actually improve your reliability with clustering and
failover.

There are a few special cases its potentially useful - lots of VMs being
one where you have some chance of a controlled failure of a bounded
system. But even in that case I know if I was admin my scripts would read

   if (hwpoison_error)
        migrate_all_guests()
        mail admin
        schedule replacement of the machine

I'm not actually sure teaching hwpoison to handle anything but losing
entire guest OS systems is useful.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 15:35           ` Ingo Molnar
  2009-06-12 16:05             ` Rik van Riel
  2009-06-12 16:37             ` H. Peter Anvin
@ 2009-06-15  6:52             ` Nick Piggin
  2009-06-16 20:27               ` Russ Anderson
  2 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2009-06-15  6:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Wu Fengguang, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Andrew Morton, LKML, Hugh Dickins, Andi Kleen,
	riel, chris.mason, linux-mm

On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > 
> > > This seems like trying to handle a failure mode that cannot be 
> > > and shouldnt be 'handled' really. If there's an 'already 
> > > corrupted' page then the box should go down hard and fast, and 
> > > we should not risk _even more user data corruption_ by trying to 
> > > 'continue' in the hope of having hit some 'harmless' user 
> > > process that can be killed ...
> > 
> > No, the box should _not_ go down hard-and-fast. That's the last 
> > thing we should *ever* do.
> > 
> > We need to log it. Often at a user level (ie we want to make sure 
> > it actually hits syslog, possibly goes out the network, maybe pops 
> > up a window, whatever).
> > 
> > Shutting down the machine is the last thing we ever want to do.
> > 
> > The whole "let's panic" mentality is a disease.
> 
> No doubt about that - and i'm removing BUG_ON()s and panic()s 
> wherever i can and havent added a single new one myself in the past 
> 5 years or so, its a disease.

In HA failover systems you often do want to panic ASAP (after logging
to serial cosole I guess) if anything like this happens so the system
can be rebooted with minimal chance of data corruption spreading.
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-12 16:37             ` H. Peter Anvin
  2009-06-12 16:48               ` Ingo Molnar
@ 2009-06-15  7:04               ` Nick Piggin
  1 sibling, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-06-15  7:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Linus Torvalds, Wu Fengguang, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, LKML, Hugh Dickins, Andi Kleen,
	riel, chris.mason, linux-mm

On Fri, Jun 12, 2009 at 09:37:24AM -0700, H. Peter Anvin wrote:
> Ingo Molnar wrote:
> > 
> > So i think hwpoison simply does not affect our ability to get log 
> > messages out - but it sure allows crappier hardware to be used.
> > Am i wrong about that for some reason?
> > 
> 
> Crappy hardware isn't the kind of hardware that is likely to have the
> hwpoison features, just like crappy hardware generally doesn't even have
> ECC -- or even basic parity checking (I personally think non-ECC memory
> should be considered a crime against humanity in this day and age.)

What I would find interesting with this hwpoison would be the probability 
difference between detecting an uncorrected error, and undetected errors.

 
> These kinds of features are used when extremely high reliability is
> required, think for example a telco core router.  A page error may have
> happened due to stray radiation or through power supply glitches (which
> happen even in the best of systems), but if they are a pattern, a box
> needs to be replaced.  *How quickly* a box can be taken out of service
> and replaced can vary greatly, and its urgency depend on patterns;
> furthermore, in the meantime the device has to work the best it can.

I don't know how much improvements that hwpoison will give. Significant
amount of RAM cannot be corrected, so especially on like a core router
or embedded system which does not use a lot of disk/pagecache, then it
is probably more like 2x improvement rather than an order of magnitude
improvement.


> Consider, for example, a control computer on the Hubble Space Telescope
> -- the only way to replace it is by space shuttle, and you can safely
> guarantee that *that* won't happen in a heartbeat.  On the new Herschel
> Space Observatory, not even the space shuttle can help: if the computers
> die, *or* if bad data gets fed to its control system, the spacecraft is
> lost.  As such, it's of paramount importance for the computers to (a)
> continue to provide service at the level the hardware is capable of
> doing, (b) as accurately as possible continually assess and report that
> level of service, and (c) not allow a failure to pass undetected.  A lot
> of failures are simple one-time events (especially in space, a high-rad
> environment), others reflect decaying hardware but can be isolated (e.g.
> a RAM cell which has developed a short circuit, or a CPU core which has
> a damaged ALU), while others yet reflect a general ill health of the
> system that cannot be recovered.

I guess most of these examples have to go far beyond this and use
multiply redundant computation and voting systems and quickly
reboot members that are kicked out. :)

Not that it is a detrement of hwpoison. If they used Linux I'm
sure they would like to panic on uncorrected error too (but would
probably not bother trying to do heuristic recovery).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-15  6:52             ` Nick Piggin
@ 2009-06-16 20:27               ` Russ Anderson
  2009-06-17  7:51                 ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Russ Anderson @ 2009-06-16 20:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Linus Torvalds, Wu Fengguang, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm, rja

On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > > 
> > > > This seems like trying to handle a failure mode that cannot be 
> > > > and shouldnt be 'handled' really. If there's an 'already 
> > > > corrupted' page then the box should go down hard and fast, and 
> > > > we should not risk _even more user data corruption_ by trying to 
> > > > 'continue' in the hope of having hit some 'harmless' user 
> > > > process that can be killed ...
> > > 
> > > No, the box should _not_ go down hard-and-fast. That's the last 
> > > thing we should *ever* do.
> > > 
> > > We need to log it. Often at a user level (ie we want to make sure 
> > > it actually hits syslog, possibly goes out the network, maybe pops 
> > > up a window, whatever).
> > > 
> > > Shutting down the machine is the last thing we ever want to do.
> > > 
> > > The whole "let's panic" mentality is a disease.
> > 
> > No doubt about that - and i'm removing BUG_ON()s and panic()s 
> > wherever i can and havent added a single new one myself in the past 
> > 5 years or so, its a disease.
> 
> In HA failover systems you often do want to panic ASAP (after logging
> to serial cosole I guess) if anything like this happens so the system
> can be rebooted with minimal chance of data corruption spreading.

The whole point of hardware data poisoning is to avoid having to 
panic the system due to the potential of undetected data corruption,
because the corrupt data is always marked bad.  This has worked
well on ia64 where applications that encounter bad data are killed
and the memory poisoned and not reallocated, avoiding a system panic.

This has been used at customer sites for a few years.  The type
customers that really check their data.  It is nice to see
the hardware poison feature moving to the x86 "mainstream".



-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
  2009-06-16 20:27               ` Russ Anderson
@ 2009-06-17  7:51                 ` Nick Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2009-06-17  7:51 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Ingo Molnar, Linus Torvalds, Wu Fengguang, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML,
	Hugh Dickins, Andi Kleen, riel, chris.mason, linux-mm

On Tue, Jun 16, 2009 at 03:27:26PM -0500, Russ Anderson wrote:
> On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> > On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > > > 
> > > > > This seems like trying to handle a failure mode that cannot be 
> > > > > and shouldnt be 'handled' really. If there's an 'already 
> > > > > corrupted' page then the box should go down hard and fast, and 
> > > > > we should not risk _even more user data corruption_ by trying to 
> > > > > 'continue' in the hope of having hit some 'harmless' user 
> > > > > process that can be killed ...
> > > > 
> > > > No, the box should _not_ go down hard-and-fast. That's the last 
> > > > thing we should *ever* do.
> > > > 
> > > > We need to log it. Often at a user level (ie we want to make sure 
> > > > it actually hits syslog, possibly goes out the network, maybe pops 
> > > > up a window, whatever).
> > > > 
> > > > Shutting down the machine is the last thing we ever want to do.
> > > > 
> > > > The whole "let's panic" mentality is a disease.
> > > 
> > > No doubt about that - and i'm removing BUG_ON()s and panic()s 
> > > wherever i can and havent added a single new one myself in the past 
> > > 5 years or so, its a disease.
> > 
> > In HA failover systems you often do want to panic ASAP (after logging
> > to serial cosole I guess) if anything like this happens so the system
> > can be rebooted with minimal chance of data corruption spreading.
> 
> The whole point of hardware data poisoning is to avoid having to 
> panic the system due to the potential of undetected data corruption,
> because the corrupt data is always marked bad.  This has worked
> well on ia64 where applications that encounter bad data are killed
> and the memory poisoned and not reallocated, avoiding a system panic.
> 
> This has been used at customer sites for a few years.  The type
> customers that really check their data.  It is nice to see
> the hardware poison feature moving to the x86 "mainstream".

So long as you can get an MCE and panic if the corrupt data
actually gets consumed anywhere, then yes a "corrupt data
detected but not consumed" exception would not require a
panic.

I don't know enough about the arch details to know what kinds
of exceptions happen when.


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-06-17  7:51 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-11 14:22 [PATCH 0/5] [RFC] HWPOISON incremental fixes Wu Fengguang
2009-06-11 14:22 ` [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled Wu Fengguang
2009-06-11 15:44   ` Rik van Riel
2009-06-12 10:00   ` Andi Kleen
2009-06-12 13:15     ` Wu Fengguang
2009-06-12 11:22   ` Ingo Molnar
2009-06-12 12:57     ` Wu Fengguang
2009-06-12 13:17       ` Ingo Molnar
2009-06-12 13:33         ` Wu Fengguang
2009-06-12 15:36           ` Ingo Molnar
2009-06-12 16:14             ` Wu Fengguang
2009-06-12 18:07               ` Alan Cox
2009-06-12 17:55             ` Theodore Tso
2009-06-12 13:58         ` Andi Kleen
2009-06-12 15:28         ` Linus Torvalds
2009-06-12 15:35           ` Ingo Molnar
2009-06-12 16:05             ` Rik van Riel
2009-06-12 16:37             ` H. Peter Anvin
2009-06-12 16:48               ` Ingo Molnar
2009-06-15  7:04               ` Nick Piggin
2009-06-15  6:52             ` Nick Piggin
2009-06-16 20:27               ` Russ Anderson
2009-06-17  7:51                 ` Nick Piggin
2009-06-12 15:45         ` Ingo Molnar
2009-06-12 16:12           ` Linus Torvalds
2009-06-11 14:22 ` [PATCH 2/5] HWPOISON: fix tasklist_lock/anon_vma locking order Wu Fengguang
2009-06-11 15:59   ` Rik van Riel
2009-06-12 10:03   ` Andi Kleen
2009-06-12 10:07     ` Nick Piggin
2009-06-12 13:27     ` Wu Fengguang
2009-06-12 14:04       ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 3/5] HWPOISON: remove early kill option for now Wu Fengguang
2009-06-11 16:06   ` Rik van Riel
2009-06-12  9:59   ` Andi Kleen
2009-06-11 14:22 ` [PATCH 4/5] HWPOISON: report sticky EIO for poisoned file Wu Fengguang
2009-06-11 16:31   ` Rik van Riel
2009-06-12 10:07   ` Andi Kleen
2009-06-12 13:41     ` Wu Fengguang
2009-06-11 14:22 ` [PATCH 5/5] HWPOISON: use the safer invalidate page for possible metadata pages Wu Fengguang
2009-06-11 16:36   ` Rik van Riel
2009-06-12 10:56 ` [PATCH 0/5] [RFC] HWPOISON incremental fixes Andi Kleen
2009-06-12 13:59   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).