linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting
@ 2009-12-08 21:16 Andi Kleen
  2009-12-08 21:16 ` [PATCH] [1/31] HWPOISON: Add Andi Kleen as hwpoison maintainer to MAINTAINERS Andi Kleen
                   ` (30 more replies)
  0 siblings, 31 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


These are the hwpoison updates for 2.6.33
I plan to send the following patchkit to Linus in a few days.
Any additional review would be appreciated.

Major new features:
- Be more aggressive at flushing caches to get access to a page
- Various fixes for the core memory_failure path
- Handle free memory better by detecting higher-order buddy pages
reliably too.
- Reliable return value for memory_failure. This allows to implement
some other functionality later on.
- New soft offlining feature:
Offline a page without killing a process.
This allows to implement predictive failure analysis for memory, by
watching error trends per page and offlining a page that has too many
corrected errors.  The policy is all in user space; the kernel just 
offlines the page and reports the errors.
The current git mcelog has support for using this interface.
- Provide a new sysfs interface for both hard and soft offlining.
The existing debugfs interface is still there.
- unpoison support
Unpoison a page. This is mainly for testing, it does not do unpoisioning
on the hardware level.
- hwpoison filter
Various filters to the hwpoison PFN error injection, including 
memcg, page type, block device and others.
This is used by the mce-test stress suite to protect the test suite itself
and

This touches some code outside hwpoison, mostly for the memcg support
and for the page types. All these changes are straight-forward,
are in linux-next and have been posted before.

-Andi


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [1/31] HWPOISON: Add Andi Kleen as hwpoison maintainer to MAINTAINERS
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [2/31] HWPOISON: Be more aggressive at freeing non LRU caches Andi Kleen
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 MAINTAINERS |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/MAINTAINERS
===================================================================
--- linux.orig/MAINTAINERS
+++ linux/MAINTAINERS
@@ -2322,6 +2322,15 @@ W:	http://www.kernel.org/pub/linux/kerne
 S:	Maintained
 F:	drivers/hwmon/hdaps.c
 
+HWPOISON MEMORY FAILURE HANDLING
+M:	Andi Kleen <andi@firstfloor.org>
+L:	linux-mm@kvack.org
+L:	linux-kernel@vger.kernel.org
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison
+S:	Maintained
+F:	mm/memory-failure.c
+F:	mm/hwpoison-inject.c
+
 HYPERVISOR VIRTUAL CONSOLE DRIVER
 L:	linuxppc-dev@ozlabs.org
 S:	Odd Fixes

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [2/31] HWPOISON: Be more aggressive at freeing non LRU caches
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
  2009-12-08 21:16 ` [PATCH] [1/31] HWPOISON: Add Andi Kleen as hwpoison maintainer to MAINTAINERS Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [3/31] page-types: add standard GPL license header Andi Kleen
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


shake_page handles more types of page caches than lru_drain_all()

- per cpu page allocator pages
- per CPU LRU

Stops early when the page became free.

Used in followon patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h  |    1 +
 mm/memory-failure.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -83,6 +83,28 @@ static int kill_proc_ao(struct task_stru
 }
 
 /*
+ * When a unknown page type is encountered drain as many buffers as possible
+ * in the hope to turn the page into a LRU or free page, which we can handle.
+ */
+void shake_page(struct page *p)
+{
+	if (!PageSlab(p)) {
+		lru_add_drain_all();
+		if (PageLRU(p))
+			return;
+		drain_all_pages();
+		if (PageLRU(p) || is_free_buddy_page(p))
+			return;
+	}
+	/*
+	 * Could call shrink_slab here (which would also
+	 * shrink other caches). Unfortunately that might
+	 * also access the corrupted page, which could be fatal.
+	 */
+}
+EXPORT_SYMBOL_GPL(shake_page);
+
+/*
  * Kill all processes that have a poisoned page mapped and then isolate
  * the page.
  *
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1320,6 +1320,7 @@ extern void memory_failure(unsigned long
 extern int __memory_failure(unsigned long pfn, int trapno, int ref);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
+extern void shake_page(struct page *p);
 extern atomic_long_t mce_bad_pages;
 
 #endif /* __KERNEL__ */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [3/31] page-types: add standard GPL license header
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
  2009-12-08 21:16 ` [PATCH] [1/31] HWPOISON: Add Andi Kleen as hwpoison maintainer to MAINTAINERS Andi Kleen
  2009-12-08 21:16 ` [PATCH] [2/31] HWPOISON: Be more aggressive at freeing non LRU caches Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [4/31] HWPOISON: remove the anonymous entry Andi Kleen
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/page-types.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux/Documentation/vm/page-types.c
===================================================================
--- linux.orig/Documentation/vm/page-types.c
+++ linux/Documentation/vm/page-types.c
@@ -1,11 +1,22 @@
 /*
  * page-types: Tool for querying page flags
  *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; version 2.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should find a copy of v2 of the GNU General Public License somewhere on
+ * your Linux system; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
  * Copyright (C) 2009 Intel corporation
  *
  * Authors: Wu Fengguang <fengguang.wu@intel.com>
- *
- * Released under the General Public License (GPL).
  */
 
 #define _LARGEFILE64_SOURCE

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [4/31] HWPOISON: remove the anonymous entry
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (2 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [3/31] page-types: add standard GPL license header Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [5/31] HWPOISON: return ENXIO on invalid page number Andi Kleen
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

(PG_swapbacked && !PG_lru) pages should not happen.
Better to treat them as unknown pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |    1 -
 1 file changed, 1 deletion(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -611,7 +611,6 @@ static struct page_state {
 
 	{ lru|dirty,	lru|dirty,	"LRU",		me_pagecache_dirty },
 	{ lru|dirty,	lru,		"clean LRU",	me_pagecache_clean },
-	{ swapbacked,	swapbacked,	"anonymous",	me_pagecache_clean },
 
 	/*
 	 * Catchall entry: must be at end.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [5/31] HWPOISON: return ENXIO on invalid page number
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (3 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [4/31] HWPOISON: remove the anonymous entry Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [6/31] HWPOISON: avoid grabbing the page count multiple times during madvise injection Andi Kleen
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Use a different errno than the usual EIO for invalid page numbers. 
This is mainly for better reporting for the injector.

This also avoids calling action_result() with invalid pfn.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -620,13 +620,11 @@ static struct page_state {
 
 static void action_result(unsigned long pfn, char *msg, int result)
 {
-	struct page *page = NULL;
-	if (pfn_valid(pfn))
-		page = pfn_to_page(pfn);
+	struct page *page = pfn_to_page(pfn);
 
 	printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
 		pfn,
-		page && PageDirty(page) ? "dirty " : "",
+		PageDirty(page) ? "dirty " : "",
 		msg, action_name[result]);
 }
 
@@ -752,8 +750,10 @@ int __memory_failure(unsigned long pfn,
 		panic("Memory failure from trap %d on page %lx", trapno, pfn);
 
 	if (!pfn_valid(pfn)) {
-		action_result(pfn, "memory outside kernel control", IGNORED);
-		return -EIO;
+		printk(KERN_ERR
+		       "MCE %#lx: memory outside kernel control\n",
+		       pfn);
+		return -ENXIO;
 	}
 
 	p = pfn_to_page(pfn);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [6/31] HWPOISON: avoid grabbing the page count multiple times during madvise injection
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (4 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [5/31] HWPOISON: return ENXIO on invalid page number Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [7/31] HWPOISON: Turn ref argument into flags argument Andi Kleen
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

If page is double referenced in madvise_hwpoison() and __memory_failure(),
remove_mapping() will fail because it expects page_count=2. Fix it by
not grabbing extra page count in __memory_failure().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/madvise.c        |    1 -
 mm/memory-failure.c |    8 ++++----
 2 files changed, 4 insertions(+), 5 deletions(-)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c
+++ linux/mm/madvise.c
@@ -238,7 +238,6 @@ static int madvise_hwpoison(unsigned lon
 		       page_to_pfn(p), start);
 		/* Ignore return value for now */
 		__memory_failure(page_to_pfn(p), 0, 1);
-		put_page(p);
 	}
 	return ret;
 }
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -629,7 +629,7 @@ static void action_result(unsigned long
 }
 
 static int page_action(struct page_state *ps, struct page *p,
-			unsigned long pfn, int ref)
+			unsigned long pfn)
 {
 	int result;
 	int count;
@@ -637,7 +637,7 @@ static int page_action(struct page_state
 	result = ps->action(p, pfn);
 	action_result(pfn, ps->msg, result);
 
-	count = page_count(p) - 1 - ref;
+	count = page_count(p) - 1;
 	if (count != 0)
 		printk(KERN_ERR
 		       "MCE %#lx: %s page still referenced by %d users\n",
@@ -775,7 +775,7 @@ int __memory_failure(unsigned long pfn,
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
-	if (!get_page_unless_zero(compound_head(p))) {
+	if (!ref && !get_page_unless_zero(compound_head(p))) {
 		action_result(pfn, "free or high order kernel", IGNORED);
 		return PageBuddy(compound_head(p)) ? 0 : -EBUSY;
 	}
@@ -823,7 +823,7 @@ int __memory_failure(unsigned long pfn,
 	res = -EBUSY;
 	for (ps = error_states;; ps++) {
 		if (((p->flags | lru_flag)& ps->mask) == ps->res) {
-			res = page_action(ps, p, pfn, ref);
+			res = page_action(ps, p, pfn);
 			break;
 		}
 	}

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [7/31] HWPOISON: Turn ref argument into flags argument
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (5 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [6/31] HWPOISON: avoid grabbing the page count multiple times during madvise injection Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [8/31] HWPOISON: abort on failed unmap Andi Kleen
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Now that "ref" is just a boolean turn it into
a flags argument. First step is only a single flag
that makes the code's intention more clear, but more
may follow.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h  |    5 ++++-
 mm/madvise.c        |    2 +-
 mm/memory-failure.c |    5 +++--
 3 files changed, 8 insertions(+), 4 deletions(-)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1316,8 +1316,11 @@ extern int account_locked_memory(struct
 				 size_t size);
 extern void refund_locked_memory(struct mm_struct *mm, size_t size);
 
+enum mf_flags {
+	MF_COUNT_INCREASED = 1 << 0,
+};
 extern void memory_failure(unsigned long pfn, int trapno);
-extern int __memory_failure(unsigned long pfn, int trapno, int ref);
+extern int __memory_failure(unsigned long pfn, int trapno, int flags);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p);
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -739,7 +739,7 @@ static void hwpoison_user_mappings(struc
 		      ret != SWAP_SUCCESS, pfn);
 }
 
-int __memory_failure(unsigned long pfn, int trapno, int ref)
+int __memory_failure(unsigned long pfn, int trapno, int flags)
 {
 	unsigned long lru_flag;
 	struct page_state *ps;
@@ -775,7 +775,8 @@ int __memory_failure(unsigned long pfn,
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
-	if (!ref && !get_page_unless_zero(compound_head(p))) {
+	if (!(flags & MF_COUNT_INCREASED) &&
+		!get_page_unless_zero(compound_head(p))) {
 		action_result(pfn, "free or high order kernel", IGNORED);
 		return PageBuddy(compound_head(p)) ? 0 : -EBUSY;
 	}
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c
+++ linux/mm/madvise.c
@@ -237,7 +237,7 @@ static int madvise_hwpoison(unsigned lon
 		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
 		       page_to_pfn(p), start);
 		/* Ignore return value for now */
-		__memory_failure(page_to_pfn(p), 0, 1);
+		__memory_failure(page_to_pfn(p), 0, MF_COUNT_INCREASED);
 	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [8/31] HWPOISON: abort on failed unmap
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (6 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [7/31] HWPOISON: Turn ref argument into flags argument Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [9/31] HWPOISON: comment the possible set_page_dirty() race Andi Kleen
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Don't try to isolate a still mapped page. Otherwise we will hit the
BUG_ON(page_mapped(page)) in __remove_from_page_cache().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -657,7 +657,7 @@ static int page_action(struct page_state
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
  */
-static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno)
 {
 	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
@@ -667,15 +667,18 @@ static void hwpoison_user_mappings(struc
 	int i;
 	int kill = 1;
 
-	if (PageReserved(p) || PageCompound(p) || PageSlab(p) || PageKsm(p))
-		return;
+	if (PageReserved(p) || PageSlab(p))
+		return SWAP_SUCCESS;
 
 	/*
 	 * This check implies we don't kill processes if their pages
 	 * are in the swap cache early. Those are always late kills.
 	 */
 	if (!page_mapped(p))
-		return;
+		return SWAP_SUCCESS;
+
+	if (PageCompound(p) || PageKsm(p))
+		return SWAP_FAIL;
 
 	if (PageSwapCache(p)) {
 		printk(KERN_ERR
@@ -737,6 +740,8 @@ static void hwpoison_user_mappings(struc
 	 */
 	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
 		      ret != SWAP_SUCCESS, pfn);
+
+	return ret;
 }
 
 int __memory_failure(unsigned long pfn, int trapno, int flags)
@@ -809,8 +814,13 @@ int __memory_failure(unsigned long pfn,
 
 	/*
 	 * Now take care of user space mappings.
+	 * Abort on fail: __remove_from_page_cache() assumes unmapped page.
 	 */
-	hwpoison_user_mappings(p, pfn, trapno);
+	if (hwpoison_user_mappings(p, pfn, trapno) != SWAP_SUCCESS) {
+		printk(KERN_ERR "MCE %#lx: cannot unmap page, give up\n", pfn);
+		res = -EBUSY;
+		goto out;
+	}
 
 	/*
 	 * Torn down by someone else?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [9/31] HWPOISON: comment the possible set_page_dirty() race
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (7 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [8/31] HWPOISON: abort on failed unmap Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [10/31] HWPOISON: comment dirty swapcache pages Andi Kleen
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -689,6 +689,8 @@ static int hwpoison_user_mappings(struct
 	/*
 	 * Propagate the dirty bit from PTEs to struct page first, because we
 	 * need this to decide if we should kill or just drop the page.
+	 * XXX: the dirty test could be racy: set_page_dirty() may not always
+	 * be called inside page lock (it's recommended but not enforced).
 	 */
 	mapping = page_mapping(p);
 	if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [10/31] HWPOISON: comment dirty swapcache pages
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (8 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [9/31] HWPOISON: comment the possible set_page_dirty() race Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [11/31] HWPOISON: introduce delete_from_lru_cache() Andi Kleen
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

AK: Improve comment

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -2540,6 +2540,10 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	} else if (PageHWPoison(page)) {
+		/*
+		 * hwpoisoned dirty swapcache pages are kept for killing
+		 * owner processes (which may be unknown at hwpoison time)
+		 */
 		ret = VM_FAULT_HWPOISON;
 		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		goto out_release;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [11/31] HWPOISON: introduce delete_from_lru_cache()
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (9 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [10/31] HWPOISON: comment dirty swapcache pages Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [12/31] HWPOISON: remove the free buddy page handler Andi Kleen
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Introduce delete_from_lru_cache() to
- clear PG_active, PG_unevictable to avoid complains at unpoison time
- move the isolate_lru_page() call back to the handlers instead of the
  entrance of __memory_failure(), this is more hwpoison filter friendly

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |   45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -350,6 +350,30 @@ static const char *action_name[] = {
 };
 
 /*
+ * XXX: It is possible that a page is isolated from LRU cache,
+ * and then kept in swap cache or failed to remove from page cache.
+ * The page count will stop it from being freed by unpoison.
+ * Stress tests should be aware of this memory leak problem.
+ */
+static int delete_from_lru_cache(struct page *p)
+{
+	if (!isolate_lru_page(p)) {
+		/*
+		 * Clear sensible page flags, so that the buddy system won't
+		 * complain when the page is unpoison-and-freed.
+		 */
+		ClearPageActive(p);
+		ClearPageUnevictable(p);
+		/*
+		 * drop the page count elevated by isolate_lru_page()
+		 */
+		page_cache_release(p);
+		return 0;
+	}
+	return -EIO;
+}
+
+/*
  * Error hit kernel page.
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
  * could be more sophisticated.
@@ -393,6 +417,8 @@ static int me_pagecache_clean(struct pag
 	int ret = FAILED;
 	struct address_space *mapping;
 
+	delete_from_lru_cache(p);
+
 	/*
 	 * For anonymous pages we're done the only reference left
 	 * should be the one m_f() holds.
@@ -522,14 +548,20 @@ static int me_swapcache_dirty(struct pag
 	/* Trigger EIO in shmem: */
 	ClearPageUptodate(p);
 
-	return DELAYED;
+	if (!delete_from_lru_cache(p))
+		return DELAYED;
+	else
+		return FAILED;
 }
 
 static int me_swapcache_clean(struct page *p, unsigned long pfn)
 {
 	delete_from_swap_cache(p);
 
-	return RECOVERED;
+	if (!delete_from_lru_cache(p))
+		return RECOVERED;
+	else
+		return FAILED;
 }
 
 /*
@@ -748,7 +780,6 @@ static int hwpoison_user_mappings(struct
 
 int __memory_failure(unsigned long pfn, int trapno, int flags)
 {
-	unsigned long lru_flag;
 	struct page_state *ps;
 	struct page *p;
 	int res;
@@ -798,13 +829,11 @@ int __memory_failure(unsigned long pfn,
 	 */
 	if (!PageLRU(p))
 		lru_add_drain_all();
-	lru_flag = p->flags & lru;
-	if (isolate_lru_page(p)) {
+	if (!PageLRU(p)) {
 		action_result(pfn, "non LRU", IGNORED);
 		put_page(p);
 		return -EBUSY;
 	}
-	page_cache_release(p);
 
 	/*
 	 * Lock the page and wait for writeback to finish.
@@ -827,7 +856,7 @@ int __memory_failure(unsigned long pfn,
 	/*
 	 * Torn down by someone else?
 	 */
-	if ((lru_flag & lru) && !PageSwapCache(p) && p->mapping == NULL) {
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
 		action_result(pfn, "already truncated LRU", IGNORED);
 		res = 0;
 		goto out;
@@ -835,7 +864,7 @@ int __memory_failure(unsigned long pfn,
 
 	res = -EBUSY;
 	for (ps = error_states;; ps++) {
-		if (((p->flags | lru_flag)& ps->mask) == ps->res) {
+		if ((p->flags & ps->mask) == ps->res) {
 			res = page_action(ps, p, pfn);
 			break;
 		}

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [12/31] HWPOISON: remove the free buddy page handler
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (10 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [11/31] HWPOISON: introduce delete_from_lru_cache() Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [13/31] HWPOISON: detect free buddy pages explicitly Andi Kleen
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

The buddy page has already be handled in the very beginning.
So remove redundant code.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |   14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -401,14 +401,6 @@ static int me_unknown(struct page *p, un
 }
 
 /*
- * Free memory
- */
-static int me_free(struct page *p, unsigned long pfn)
-{
-	return DELAYED;
-}
-
-/*
  * Clean (or cleaned) page cache page.
  */
 static int me_pagecache_clean(struct page *p, unsigned long pfn)
@@ -604,7 +596,6 @@ static int me_huge_page(struct page *p,
 #define tail		(1UL << PG_tail)
 #define compound	(1UL << PG_compound)
 #define slab		(1UL << PG_slab)
-#define buddy		(1UL << PG_buddy)
 #define reserved	(1UL << PG_reserved)
 
 static struct page_state {
@@ -614,7 +605,10 @@ static struct page_state {
 	int (*action)(struct page *p, unsigned long pfn);
 } error_states[] = {
 	{ reserved,	reserved,	"reserved kernel",	me_ignore },
-	{ buddy,	buddy,		"free kernel",	me_free },
+	/*
+	 * free pages are specially detected outside this table:
+	 * PG_buddy pages only make a small fraction of all free pages.
+	 */
 
 	/*
 	 * Could in theory check if slab page is free or if we can drop

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [13/31] HWPOISON: detect free buddy pages explicitly
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (11 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [12/31] HWPOISON: remove the free buddy page handler Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [14/31] HWPOISON: Add unpoisoning support Andi Kleen
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, npiggin, mel, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Most free pages in the buddy system have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

CC: Nick Piggin <npiggin@suse.de> 
CC: Mel Gorman <mel@linux.vnet.ibm.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/internal.h       |    3 +++
 mm/memory-failure.c |    9 +++++++--
 mm/page_alloc.c     |   21 +++++++++++++++++++++
 3 files changed, 31 insertions(+), 2 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -809,8 +809,13 @@ int __memory_failure(unsigned long pfn,
 	 */
 	if (!(flags & MF_COUNT_INCREASED) &&
 		!get_page_unless_zero(compound_head(p))) {
-		action_result(pfn, "free or high order kernel", IGNORED);
-		return PageBuddy(compound_head(p)) ? 0 : -EBUSY;
+		if (is_free_buddy_page(p)) {
+			action_result(pfn, "free buddy", DELAYED);
+			return 0;
+		} else {
+			action_result(pfn, "high order kernel", IGNORED);
+			return -EBUSY;
+		}
 	}
 
 	/*
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -50,6 +50,9 @@ extern void putback_lru_page(struct page
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+#ifdef CONFIG_MEMORY_FAILURE
+extern bool is_free_buddy_page(struct page *page);
+#endif
 
 
 /*
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -5085,3 +5085,24 @@ __offline_isolated_pages(unsigned long s
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 #endif
+
+#ifdef CONFIG_MEMORY_FAILURE
+bool is_free_buddy_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	int order;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	for (order = 0; order < MAX_ORDER; order++) {
+		struct page *page_head = page - (pfn & ((1 << order) - 1));
+
+		if (PageBuddy(page_head) && page_order(page_head) >= order)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return order < MAX_ORDER;
+}
+#endif

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [14/31] HWPOISON: Add unpoisoning support
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (12 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [13/31] HWPOISON: detect free buddy pages explicitly Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [15/31] HWPOISON: make semantics of IGNORED/DELAYED clear Andi Kleen
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

The unpoisoning interface is useful for stress testing tools to 
reclaim poisoned pages (to prevent OOM)

There is no hardware level unpoisioning, so this
cannot be used for real memory errors, only for software injected errors.

Note that it may leak pages silently - those who have been removed from
LRU cache, but not isolated from page cache/swap cache at hwpoison time.
Especially the stress test of dirty swap cache pages shall reboot system
before exhausting memory.

AK: Fix comments, add documentation, add printks, rename symbol

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/hwpoison.txt |   16 ++++++++-
 include/linux/mm.h            |    1 
 include/linux/page-flags.h    |    2 -
 mm/hwpoison-inject.c          |   36 ++++++++++++++++++----
 mm/memory-failure.c           |   68 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 114 insertions(+), 9 deletions(-)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -4,7 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/mm.h>
 
-static struct dentry *hwpoison_dir, *corrupt_pfn;
+static struct dentry *hwpoison_dir;
 
 static int hwpoison_inject(void *data, u64 val)
 {
@@ -14,7 +14,16 @@ static int hwpoison_inject(void *data, u
 	return __memory_failure(val, 18, 0);
 }
 
+static int hwpoison_unpoison(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	return unpoison_memory(val);
+}
+
 DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+DEFINE_SIMPLE_ATTRIBUTE(unpoison_fops, NULL, hwpoison_unpoison, "%lli\n");
 
 static void pfn_inject_exit(void)
 {
@@ -24,16 +33,31 @@ static void pfn_inject_exit(void)
 
 static int pfn_inject_init(void)
 {
+	struct dentry *dentry;
+
 	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
 	if (hwpoison_dir == NULL)
 		return -ENOMEM;
-	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+
+	/*
+	 * Note that the below poison/unpoison interfaces do not involve
+	 * hardware status change, hence do not require hardware support.
+	 * They are mainly for testing hwpoison in software level.
+	 */
+	dentry = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
 					  NULL, &hwpoison_fops);
-	if (corrupt_pfn == NULL) {
-		pfn_inject_exit();
-		return -ENOMEM;
-	}
+	if (!dentry)
+		goto fail;
+
+	dentry = debugfs_create_file("unpoison-pfn", 0600, hwpoison_dir,
+				     NULL, &unpoison_fops);
+	if (!dentry)
+		goto fail;
+
 	return 0;
+fail:
+	pfn_inject_exit();
+	return -ENOMEM;
 }
 
 module_init(pfn_inject_init);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1321,6 +1321,7 @@ enum mf_flags {
 };
 extern void memory_failure(unsigned long pfn, int trapno);
 extern int __memory_failure(unsigned long pfn, int trapno, int flags);
+extern int unpoison_memory(unsigned long pfn);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p);
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -840,6 +840,16 @@ int __memory_failure(unsigned long pfn,
 	 * and in many cases impossible, so we just avoid it here.
 	 */
 	lock_page_nosync(p);
+
+	/*
+	 * unpoison always clear PG_hwpoison inside page lock
+	 */
+	if (!PageHWPoison(p)) {
+		action_result(pfn, "unpoisoned", IGNORED);
+		res = 0;
+		goto out;
+	}
+
 	wait_on_page_writeback(p);
 
 	/*
@@ -895,3 +905,61 @@ void memory_failure(unsigned long pfn, i
 {
 	__memory_failure(pfn, trapno, 0);
 }
+
+/**
+ * unpoison_memory - Unpoison a previously poisoned page
+ * @pfn: Page number of the to be unpoisoned page
+ *
+ * Software-unpoison a page that has been poisoned by
+ * memory_failure() earlier.
+ *
+ * This is only done on the software-level, so it only works
+ * for linux injected failures, not real hardware failures
+ *
+ * Returns 0 for success, otherwise -errno.
+ */
+int unpoison_memory(unsigned long pfn)
+{
+	struct page *page;
+	struct page *p;
+	int freeit = 0;
+
+	if (!pfn_valid(pfn))
+		return -ENXIO;
+
+	p = pfn_to_page(pfn);
+	page = compound_head(p);
+
+	if (!PageHWPoison(p)) {
+		pr_debug("MCE: Page was already unpoisoned %#lx\n", pfn);
+		return 0;
+	}
+
+	if (!get_page_unless_zero(page)) {
+		if (TestClearPageHWPoison(p))
+			atomic_long_dec(&mce_bad_pages);
+		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
+		return 0;
+	}
+
+	lock_page_nosync(page);
+	/*
+	 * This test is racy because PG_hwpoison is set outside of page lock.
+	 * That's acceptable because that won't trigger kernel panic. Instead,
+	 * the PG_hwpoison page will be caught and isolated on the entrance to
+	 * the free buddy page pool.
+	 */
+	if (TestClearPageHWPoison(p)) {
+		pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
+		atomic_long_dec(&mce_bad_pages);
+		freeit = 1;
+	}
+	unlock_page(page);
+
+	put_page(page);
+	if (freeit)
+		put_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL(unpoison_memory);
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -277,7 +277,7 @@ PAGEFLAG_FALSE(Uncached)
 
 #ifdef CONFIG_MEMORY_FAILURE
 PAGEFLAG(HWPoison, hwpoison)
-TESTSETFLAG(HWPoison, hwpoison)
+TESTSCFLAG(HWPoison, hwpoison)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 #else
 PAGEFLAG_FALSE(HWPoison)
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -98,10 +98,22 @@ madvise(MADV_POISON, ....)
 
 
 hwpoison-inject module through debugfs
-	/sys/debug/hwpoison/corrupt-pfn
 
-Inject hwpoison fault at PFN echoed into this file
+/sys/debug/hwpoison/
 
+corrupt-pfn
+
+Inject hwpoison fault at PFN echoed into this file.
+
+unpoison-pfn
+
+Software-unpoison page at PFN echoed into this file. This
+way a page can be reused again.
+This only works for Linux injected failures, not for real
+memory failures.
+
+Note these injection interfaces are not stable and might change between
+kernel versions
 
 Architecture specific MCE injector
 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [15/31] HWPOISON: make semantics of IGNORED/DELAYED clear
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (13 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [14/31] HWPOISON: Add unpoisoning support Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [16/31] HWPOISON: return 0 to indicate success reliably Andi Kleen
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Change semantics for
- IGNORED: not handled; it may well be _unsafe_
- DELAYED: to be handled later; it is _safe_

With this change,
- IGNORED/FAILED mean (maybe) Error
- DELAYED/RECOVERED mean Success

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |   22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -336,16 +336,16 @@ static void collect_procs(struct page *p
  */
 
 enum outcome {
-	FAILED,		/* Error handling failed */
+	IGNORED,	/* Error: cannot be handled */
+	FAILED,		/* Error: handling failed */
 	DELAYED,	/* Will be handled later */
-	IGNORED,	/* Error safely ignored */
 	RECOVERED,	/* Successfully recovered */
 };
 
 static const char *action_name[] = {
+	[IGNORED] = "Ignored",
 	[FAILED] = "Failed",
 	[DELAYED] = "Delayed",
-	[IGNORED] = "Ignored",
 	[RECOVERED] = "Recovered",
 };
 
@@ -380,14 +380,6 @@ static int delete_from_lru_cache(struct
  */
 static int me_kernel(struct page *p, unsigned long pfn)
 {
-	return DELAYED;
-}
-
-/*
- * Already poisoned page.
- */
-static int me_ignore(struct page *p, unsigned long pfn)
-{
 	return IGNORED;
 }
 
@@ -604,7 +596,7 @@ static struct page_state {
 	char *msg;
 	int (*action)(struct page *p, unsigned long pfn);
 } error_states[] = {
-	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ reserved,	reserved,	"reserved kernel",	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
 	 * PG_buddy pages only make a small fraction of all free pages.
@@ -790,7 +782,7 @@ int __memory_failure(unsigned long pfn,
 
 	p = pfn_to_page(pfn);
 	if (TestSetPageHWPoison(p)) {
-		action_result(pfn, "already hardware poisoned", IGNORED);
+		printk(KERN_ERR "MCE %#lx: already hardware poisoned\n", pfn);
 		return 0;
 	}
 
@@ -845,7 +837,7 @@ int __memory_failure(unsigned long pfn,
 	 * unpoison always clear PG_hwpoison inside page lock
 	 */
 	if (!PageHWPoison(p)) {
-		action_result(pfn, "unpoisoned", IGNORED);
+		printk(KERN_ERR "MCE %#lx: just unpoisoned\n", pfn);
 		res = 0;
 		goto out;
 	}
@@ -867,7 +859,7 @@ int __memory_failure(unsigned long pfn,
 	 */
 	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
 		action_result(pfn, "already truncated LRU", IGNORED);
-		res = 0;
+		res = -EBUSY;
 		goto out;
 	}
 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [16/31] HWPOISON: return 0 to indicate success reliably
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (14 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [15/31] HWPOISON: make semantics of IGNORED/DELAYED clear Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [17/31] HWPOISON: add fs/device filters Andi Kleen
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Return 0 to indicate success, when
- action result is RECOVERED or DELAYED
- no extra page reference

Note that dirty swapcache pages are kept in swapcache, so can have one
more reference count.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -656,17 +656,21 @@ static int page_action(struct page_state
 	action_result(pfn, ps->msg, result);
 
 	count = page_count(p) - 1;
-	if (count != 0)
+	if (ps->action == me_swapcache_dirty && result == DELAYED)
+		count--;
+	if (count != 0) {
 		printk(KERN_ERR
 		       "MCE %#lx: %s page still referenced by %d users\n",
 		       pfn, ps->msg, count);
+		result = FAILED;
+	}
 
 	/* Could do more checks here if page looks ok */
 	/*
 	 * Could adjust zone counters here to correct for the missing page.
 	 */
 
-	return result == RECOVERED ? 0 : -EBUSY;
+	return (result == RECOVERED || result == DELAYED) ? 0 : -EBUSY;
 }
 
 #define N_UNMAP_TRIES 5

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [17/31] HWPOISON: add fs/device filters
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (15 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [16/31] HWPOISON: return 0 to indicate success reliably Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [18/31] HWPOISON: limit hwpoison injector to known page types Andi Kleen
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, haicheng.li, npiggin, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Filesystem data/metadata present the most tricky-to-isolate pages.
It requires careful code review and stress testing to get them right.

The fs/device filter helps to target the stress tests to some specific
filesystem pages. The filter condition is block device's major/minor
numbers:
        - corrupt-filter-dev-major
        - corrupt-filter-dev-minor
When specified (non -1), only page cache pages that belong to that
device will be poisoned.

The filters are checked reliably on the locked and refcounted page.

Haicheng: clear PG_hwpoison and drop bad page count if filter not OK
AK: Add documentation

CC: Haicheng Li <haicheng.li@intel.com>
CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/hwpoison.txt |    7 +++++
 mm/hwpoison-inject.c          |   11 +++++++++
 mm/internal.h                 |    3 ++
 mm/memory-failure.c           |   51 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 72 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -48,6 +48,50 @@ int sysctl_memory_failure_recovery __rea
 
 atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+u32 hwpoison_filter_dev_major = ~0U;
+u32 hwpoison_filter_dev_minor = ~0U;
+EXPORT_SYMBOL_GPL(hwpoison_filter_dev_major);
+EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
+
+static int hwpoison_filter_dev(struct page *p)
+{
+	struct address_space *mapping;
+	dev_t dev;
+
+	if (hwpoison_filter_dev_major == ~0U &&
+	    hwpoison_filter_dev_minor == ~0U)
+		return 0;
+
+	/*
+	 * page_mapping() does not accept slab page
+	 */
+	if (PageSlab(p))
+		return -EINVAL;
+
+	mapping = page_mapping(p);
+	if (mapping == NULL || mapping->host == NULL)
+		return -EINVAL;
+
+	dev = mapping->host->i_sb->s_dev;
+	if (hwpoison_filter_dev_major != ~0U &&
+	    hwpoison_filter_dev_major != MAJOR(dev))
+		return -EINVAL;
+	if (hwpoison_filter_dev_minor != ~0U &&
+	    hwpoison_filter_dev_minor != MINOR(dev))
+		return -EINVAL;
+
+	return 0;
+}
+
+int hwpoison_filter(struct page *p)
+{
+	if (hwpoison_filter_dev(p))
+		return -EINVAL;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hwpoison_filter);
+
 /*
  * Send all the processes who have the page mapped an ``action optional''
  * signal.
@@ -845,6 +889,13 @@ int __memory_failure(unsigned long pfn,
 		res = 0;
 		goto out;
 	}
+	if (hwpoison_filter(p)) {
+		if (TestClearPageHWPoison(p))
+			atomic_long_dec(&mce_bad_pages);
+		unlock_page(p);
+		put_page(p);
+		return 0;
+	}
 
 	wait_on_page_writeback(p);
 
Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -3,6 +3,7 @@
 #include <linux/debugfs.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include "internal.h"
 
 static struct dentry *hwpoison_dir;
 
@@ -54,6 +55,16 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
+	dentry = debugfs_create_u32("corrupt-filter-dev-major", 0600,
+				    hwpoison_dir, &hwpoison_filter_dev_major);
+	if (!dentry)
+		goto fail;
+
+	dentry = debugfs_create_u32("corrupt-filter-dev-minor", 0600,
+				    hwpoison_dir, &hwpoison_filter_dev_minor);
+	if (!dentry)
+		goto fail;
+
 	return 0;
 fail:
 	pfn_inject_exit();
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -263,3 +263,6 @@ int __get_user_pages(struct task_struct
 #define ZONE_RECLAIM_SOME	0
 #define ZONE_RECLAIM_SUCCESS	1
 #endif
+
+extern u32 hwpoison_filter_dev_major;
+extern u32 hwpoison_filter_dev_minor;
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -115,6 +115,13 @@ memory failures.
 Note these injection interfaces are not stable and might change between
 kernel versions
 
+corrupt-filter-dev-major
+corrupt-filter-dev-minor
+
+Only handle memory failures to pages associated with the file system defined
+by block device major/minor.  -1U is the wildcard value.
+This should be only used for testing with artificial injection.
+
 Architecture specific MCE injector
 
 x86 has mce-inject, mce-test

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [18/31] HWPOISON: limit hwpoison injector to known page types
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (16 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [17/31] HWPOISON: add fs/device filters Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [19/31] mm: export stable page flags Andi Kleen
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, haicheng.li, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

__memory_failure()'s workflow is

	set PG_hwpoison
	//...
	unset PG_hwpoison if didn't pass hwpoison filter

That could kill unrelated process if it happens to page fault on the
page with the (temporary) PG_hwpoison. The race should be big enough to
appear in stress tests.

Fix it by grabbing the page and checking filter at inject time.  This
also avoids the very noisy "Injecting memory failure..." messages.

- we don't touch madvise() based injection, because the filters are
  generally not necessary for it.
- if we want to apply the filters to h/w aided injection, we'd better to
  rearrange the logic in __memory_failure() instead of this patch.

AK: fix documentation, use drain all, cleanups

CC: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/hwpoison.txt |    3 ++-
 mm/hwpoison-inject.c          |   41 +++++++++++++++++++++++++++++++++++++++--
 mm/internal.h                 |    2 ++
 3 files changed, 43 insertions(+), 3 deletions(-)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -3,16 +3,53 @@
 #include <linux/debugfs.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
 #include "internal.h"
 
 static struct dentry *hwpoison_dir;
 
 static int hwpoison_inject(void *data, u64 val)
 {
+	unsigned long pfn = val;
+	struct page *p;
+	int err;
+
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
-	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
-	return __memory_failure(val, 18, 0);
+
+	if (!pfn_valid(pfn))
+		return -ENXIO;
+
+	p = pfn_to_page(pfn);
+	/*
+	 * This implies unable to support free buddy pages.
+	 */
+	if (!get_page_unless_zero(p))
+		return 0;
+
+	if (!PageLRU(p))
+		shake_page(p);
+	/*
+	 * This implies unable to support non-LRU pages.
+	 */
+	if (!PageLRU(p))
+		return 0;
+
+	/*
+	 * do a racy check with elevated page count, to make sure PG_hwpoison
+	 * will only be set for the targeted owner (or on a free page).
+	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
+	 * __memory_failure() will redo the check reliably inside page lock.
+	 */
+	lock_page(p);
+	err = hwpoison_filter(p);
+	unlock_page(p);
+	if (err)
+		return 0;
+
+	printk(KERN_INFO "Injecting memory failure at pfn %lx\n", pfn);
+	return __memory_failure(pfn, 18, MF_COUNT_INCREASED);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -264,5 +264,7 @@ int __get_user_pages(struct task_struct
 #define ZONE_RECLAIM_SUCCESS	1
 #endif
 
+extern int hwpoison_filter(struct page *p);
+
 extern u32 hwpoison_filter_dev_major;
 extern u32 hwpoison_filter_dev_minor;
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -103,7 +103,8 @@ hwpoison-inject module through debugfs
 
 corrupt-pfn
 
-Inject hwpoison fault at PFN echoed into this file.
+Inject hwpoison fault at PFN echoed into this file. This does
+some early filtering to avoid corrupted unintended pages in test suites.
 
 unpoison-pfn
 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [19/31] mm: export stable page flags
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (17 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [18/31] HWPOISON: limit hwpoison injector to known page types Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 22:27   ` Matt Mackall
  2009-12-08 21:16 ` [PATCH] [20/31] HWPOISON: add page flags filter Andi Kleen
                   ` (11 subsequent siblings)
  30 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, mpm, npiggin, cl, andi, fengguang.wu, linux-kernel,
	linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

Rename get_uflags() to stable_page_flags() and make it a global function
for use in the hwpoison page flags filter, which need to compare user
page flags with the value provided by user space.

Also move KPF_* to kernel-page-flags.h for use by user space tools.

CC: Matt Mackall <mpm@selenic.com>
CC: Nick Piggin <npiggin@suse.de>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/page.c                    |   45 ++-----------------------------------
 include/linux/kernel-page-flags.h |   46 ++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h        |    2 +
 3 files changed, 51 insertions(+), 42 deletions(-)

Index: linux/fs/proc/page.c
===================================================================
--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -8,6 +8,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
+#include <linux/kernel-page-flags.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -71,52 +72,12 @@ static const struct file_operations proc
  * physical page flags.
  */
 
-/* These macros are used to decouple internal flags from exported ones */
-
-#define KPF_LOCKED		0
-#define KPF_ERROR		1
-#define KPF_REFERENCED		2
-#define KPF_UPTODATE		3
-#define KPF_DIRTY		4
-#define KPF_LRU			5
-#define KPF_ACTIVE		6
-#define KPF_SLAB		7
-#define KPF_WRITEBACK		8
-#define KPF_RECLAIM		9
-#define KPF_BUDDY		10
-
-/* 11-20: new additions in 2.6.31 */
-#define KPF_MMAP		11
-#define KPF_ANON		12
-#define KPF_SWAPCACHE		13
-#define KPF_SWAPBACKED		14
-#define KPF_COMPOUND_HEAD	15
-#define KPF_COMPOUND_TAIL	16
-#define KPF_HUGE		17
-#define KPF_UNEVICTABLE		18
-#define KPF_HWPOISON		19
-#define KPF_NOPAGE		20
-
-#define KPF_KSM			21
-
-/* kernel hacking assistances
- * WARNING: subject to change, never rely on them!
- */
-#define KPF_RESERVED		32
-#define KPF_MLOCKED		33
-#define KPF_MAPPEDTODISK	34
-#define KPF_PRIVATE		35
-#define KPF_PRIVATE_2		36
-#define KPF_OWNER_PRIVATE	37
-#define KPF_ARCH		38
-#define KPF_UNCACHED		39
-
 static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
 {
 	return ((kflags >> kbit) & 1) << ubit;
 }
 
-static u64 get_uflags(struct page *page)
+u64 stable_page_flags(struct page *page)
 {
 	u64 k;
 	u64 u;
@@ -219,7 +180,7 @@ static ssize_t kpageflags_read(struct fi
 		else
 			ppage = NULL;
 
-		if (put_user(get_uflags(ppage), out)) {
+		if (put_user(stable_page_flags(ppage), out)) {
 			ret = -EFAULT;
 			break;
 		}
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -284,6 +284,8 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+u64 stable_page_flags(struct page *page);
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
Index: linux/include/linux/kernel-page-flags.h
===================================================================
--- /dev/null
+++ linux/include/linux/kernel-page-flags.h
@@ -0,0 +1,46 @@
+#ifndef LINUX_KERNEL_PAGE_FLAGS_H
+#define LINUX_KERNEL_PAGE_FLAGS_H
+
+/*
+ * Stable page flag bits exported to user space
+ */
+
+#define KPF_LOCKED		0
+#define KPF_ERROR		1
+#define KPF_REFERENCED		2
+#define KPF_UPTODATE		3
+#define KPF_DIRTY		4
+#define KPF_LRU			5
+#define KPF_ACTIVE		6
+#define KPF_SLAB		7
+#define KPF_WRITEBACK		8
+#define KPF_RECLAIM		9
+#define KPF_BUDDY		10
+
+/* 11-20: new additions in 2.6.31 */
+#define KPF_MMAP		11
+#define KPF_ANON		12
+#define KPF_SWAPCACHE		13
+#define KPF_SWAPBACKED		14
+#define KPF_COMPOUND_HEAD	15
+#define KPF_COMPOUND_TAIL	16
+#define KPF_HUGE		17
+#define KPF_UNEVICTABLE		18
+#define KPF_HWPOISON		19
+#define KPF_NOPAGE		20
+
+#define KPF_KSM			21
+
+/* kernel hacking assistances
+ * WARNING: subject to change, never rely on them!
+ */
+#define KPF_RESERVED		32
+#define KPF_MLOCKED		33
+#define KPF_MAPPEDTODISK	34
+#define KPF_PRIVATE		35
+#define KPF_PRIVATE_2		36
+#define KPF_OWNER_PRIVATE	37
+#define KPF_ARCH		38
+#define KPF_UNCACHED		39
+
+#endif /* LINUX_KERNEL_PAGE_FLAGS_H */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [20/31] HWPOISON: add page flags filter
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (18 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [19/31] mm: export stable page flags Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [21/31] memcg: rename and export try_get_mem_cgroup_from_page() Andi Kleen
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, npiggin, fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

When specified, only poison pages if ((page_flags & mask) == value).

-       corrupt-filter-flags-mask
-       corrupt-filter-flags-value

This allows stress testing of many kinds of pages.

Strictly speaking, the buddy pages requires taking zone lock, to avoid
setting PG_hwpoison on a "was buddy but now allocated to someone" page.
However we can just do nothing because we set PG_locked in the beginning,
this prevents the page allocator from allocating it to someone. (It will
BUG() on the unexpected PG_locked, which is fine for hwpoison testing.)

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/hwpoison.txt |   10 ++++++++++
 mm/hwpoison-inject.c          |   10 ++++++++++
 mm/internal.h                 |    2 ++
 mm/memory-failure.c           |   20 ++++++++++++++++++++
 4 files changed, 42 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -102,6 +102,16 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
+	dentry = debugfs_create_u64("corrupt-filter-flags-mask", 0600,
+				    hwpoison_dir, &hwpoison_filter_flags_mask);
+	if (!dentry)
+		goto fail;
+
+	dentry = debugfs_create_u64("corrupt-filter-flags-value", 0600,
+				    hwpoison_dir, &hwpoison_filter_flags_value);
+	if (!dentry)
+		goto fail;
+
 	return 0;
 fail:
 	pfn_inject_exit();
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -268,3 +268,5 @@ extern int hwpoison_filter(struct page *
 
 extern u32 hwpoison_filter_dev_major;
 extern u32 hwpoison_filter_dev_minor;
+extern u64 hwpoison_filter_flags_mask;
+extern u64 hwpoison_filter_flags_value;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -34,6 +34,7 @@
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/page-flags.h>
+#include <linux/kernel-page-flags.h>
 #include <linux/sched.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
@@ -50,8 +51,12 @@ atomic_long_t mce_bad_pages __read_mostl
 
 u32 hwpoison_filter_dev_major = ~0U;
 u32 hwpoison_filter_dev_minor = ~0U;
+u64 hwpoison_filter_flags_mask;
+u64 hwpoison_filter_flags_value;
 EXPORT_SYMBOL_GPL(hwpoison_filter_dev_major);
 EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
+EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
+EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
 
 static int hwpoison_filter_dev(struct page *p)
 {
@@ -83,11 +88,26 @@ static int hwpoison_filter_dev(struct pa
 	return 0;
 }
 
+static int hwpoison_filter_flags(struct page *p)
+{
+	if (!hwpoison_filter_flags_mask)
+		return 0;
+
+	if ((stable_page_flags(p) & hwpoison_filter_flags_mask) ==
+				    hwpoison_filter_flags_value)
+		return 0;
+	else
+		return -EINVAL;
+}
+
 int hwpoison_filter(struct page *p)
 {
 	if (hwpoison_filter_dev(p))
 		return -EINVAL;
 
+	if (hwpoison_filter_flags(p))
+		return -EINVAL;
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(hwpoison_filter);
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -123,6 +123,16 @@ Only handle memory failures to pages ass
 by block device major/minor.  -1U is the wildcard value.
 This should be only used for testing with artificial injection.
 
+
+corrupt-filter-flags-mask
+corrupt-filter-flags-value
+
+When specified, only poison pages if ((page_flags & mask) == value).
+This allows stress testing of many kinds of pages. The page_flags
+are the same as in /proc/kpageflags. The flag bits are defined in
+include/linux/kernel-page-flags.h and documented in
+Documentation/vm/pagemap.txt
+
 Architecture specific MCE injector
 
 x86 has mce-inject, mce-test

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [21/31] memcg: rename and export try_get_mem_cgroup_from_page()
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (19 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [20/31] HWPOISON: add page flags filter Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [22/31] memcg: add accessor to mem_cgroup.css Andi Kleen
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, kosaki.motohiro, hugh.dickins, nishimura, balbir,
	fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

So that the hwpoison injector can get mem_cgroup for arbitrary page
and thus know whether it is owned by some mem_cgroup task(s).

CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |   11 ++++-------
 2 files changed, 10 insertions(+), 7 deletions(-)

Index: linux/mm/memcontrol.c
===================================================================
--- linux.orig/mm/memcontrol.c
+++ linux/mm/memcontrol.c
@@ -1379,25 +1379,22 @@ static struct mem_cgroup *mem_cgroup_loo
 	return container_of(css, struct mem_cgroup, css);
 }
 
-static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
+struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem = NULL;
 	struct page_cgroup *pc;
 	unsigned short id;
 	swp_entry_t ent;
 
 	VM_BUG_ON(!PageLocked(page));
 
-	if (!PageSwapCache(page))
-		return NULL;
-
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
 			mem = NULL;
-	} else {
+	} else if (PageSwapCache(page)) {
 		ent.val = page_private(page);
 		id = lookup_swap_cgroup(ent);
 		rcu_read_lock();
@@ -1742,7 +1739,7 @@ int mem_cgroup_try_charge_swapin(struct
 	 */
 	if (!PageSwapCache(page))
 		return 0;
-	mem = try_get_mem_cgroup_from_swapcache(page);
+	mem = try_get_mem_cgroup_from_page(page);
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
Index: linux/include/linux/memcontrol.h
===================================================================
--- linux.orig/include/linux/memcontrol.h
+++ linux/include/linux/memcontrol.h
@@ -68,6 +68,7 @@ extern unsigned long mem_cgroup_isolate_
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
+extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 static inline
@@ -189,6 +190,11 @@ mem_cgroup_move_lists(struct page *page,
 {
 }
 
+static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
 static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
 {
 	return 1;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [22/31] memcg: add accessor to mem_cgroup.css
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (20 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [21/31] memcg: rename and export try_get_mem_cgroup_from_page() Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [23/31] HWPOISON: add memory cgroup filter Andi Kleen
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, kosaki.motohiro, hugh.dickins, nishimura, balbir,
	fengguang.wu, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

So that an outside user can free the reference count grabbed by
try_get_mem_cgroup_from_page().

CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/memcontrol.h |    7 +++++++
 mm/memcontrol.c            |    5 +++++
 2 files changed, 12 insertions(+)

Index: linux/include/linux/memcontrol.h
===================================================================
--- linux.orig/include/linux/memcontrol.h
+++ linux/include/linux/memcontrol.h
@@ -81,6 +81,8 @@ int mm_match_cgroup(const struct mm_stru
 	return cgroup == mem;
 }
 
+extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem);
+
 extern int
 mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
 extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
@@ -206,6 +208,11 @@ static inline int task_in_mem_cgroup(str
 	return 1;
 }
 
+static inline struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
+{
+	return NULL;
+}
+
 static inline int
 mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 {
Index: linux/mm/memcontrol.c
===================================================================
--- linux.orig/mm/memcontrol.c
+++ linux/mm/memcontrol.c
@@ -282,6 +282,11 @@ mem_cgroup_zoneinfo(struct mem_cgroup *m
 	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
 }
 
+struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
+{
+	return &mem->css;
+}
+
 static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (21 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [22/31] memcg: add accessor to mem_cgroup.css Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-09  5:04   ` Li Zefan
  2009-12-09 20:47   ` Paul Menage
  2009-12-08 21:16 ` [PATCH] [24/31] HWPOISON: add an interface to switch off/on all the page filters Andi Kleen
                   ` (7 subsequent siblings)
  30 siblings, 2 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: kosaki.motohiro, hugh.dickins, nishimura, balbir,
	kamezawa.hiroyu, lizf, menage, npiggin, andi, fengguang.wu,
	linux-kernel, linux-mm


The hwpoison test suite need to inject hwpoison to a collection of
selected task pages, and must not touch pages not owned by them and
thus kill important system processes such as init. (But it's OK to
mis-hwpoison free/unowned pages as well as shared clean pages.
Mis-hwpoison of shared dirty pages will kill all tasks, so the test
suite will target all or non of such tasks in the first place.)

The memory cgroup serves this purpose well. We can put the target
processes under the control of a memory cgroup, and tell the hwpoison
injection code to only kill pages associated with some active memory
cgroup.

The prerequisite for doing hwpoison stress tests with mem_cgroup is,
the mem_cgroup code tracks task pages _accurately_ (unless page is
locked).  Which we believe is/should be true.

The benefits are simplification of hwpoison injector code. Also the
mem_cgroup code will automatically be tested by hwpoison test cases.

The alternative interfaces pin-pfn/unpin-pfn can also delegate the
(process and page flags) filtering functions reliably to user space.
However prototype implementation shows that this scheme adds more
complexity than we wanted.

Example test case:

	mkdir /cgroup/hwpoison

	usemem -m 100 -s 1000 &
	echo `jobs -p` > /cgroup/hwpoison/tasks

	memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
	echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg

	page-types -p `pidof init`   --hwpoison  # shall do nothing
	page-types -p `pidof usemem` --hwpoison  # poison its pages

AK: Fix documentation

CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Paul Menage <menage@google.com>
CC: Nick Piggin <npiggin@suse.de> 
CC: Andi Kleen <andi@firstfloor.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/hwpoison.txt |   16 ++++++++++++++++
 mm/hwpoison-inject.c          |    7 +++++++
 mm/internal.h                 |    1 +
 mm/memory-failure.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 66 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -100,6 +100,45 @@ static int hwpoison_filter_flags(struct
 		return -EINVAL;
 }
 
+/*
+ * This allows stress tests to limit test scope to a collection of tasks
+ * by putting them under some memcg. This prevents killing unrelated/important
+ * processes such as /sbin/init. Note that the target task may share clean
+ * pages with init (eg. libc text), which is harmless. If the target task
+ * share _dirty_ pages with another task B, the test scheme must make sure B
+ * is also included in the memcg. At last, due to race conditions this filter
+ * can only guarantee that the page either belongs to the memcg tasks, or is
+ * a freed page.
+ */
+#ifdef	CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+u64 hwpoison_filter_memcg;
+EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
+static int hwpoison_filter_task(struct page *p)
+{
+	struct mem_cgroup *mem;
+	struct cgroup_subsys_state *css;
+	unsigned long ino;
+
+	if (!hwpoison_filter_memcg)
+		return 0;
+
+	mem = try_get_mem_cgroup_from_page(p);
+	if (!mem)
+		return -EINVAL;
+
+	css = mem_cgroup_css(mem);
+	ino = css->cgroup->dentry->d_inode->i_ino;
+	css_put(css);
+
+	if (ino != hwpoison_filter_memcg)
+		return -EINVAL;
+
+	return 0;
+}
+#else
+static int hwpoison_filter_task(struct page *p) { return 0; }
+#endif
+
 int hwpoison_filter(struct page *p)
 {
 	if (hwpoison_filter_dev(p))
@@ -108,6 +147,9 @@ int hwpoison_filter(struct page *p)
 	if (hwpoison_filter_flags(p))
 		return -EINVAL;
 
+	if (hwpoison_filter_task(p))
+		return -EINVAL;
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(hwpoison_filter);
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -270,3 +270,4 @@ extern u32 hwpoison_filter_dev_major;
 extern u32 hwpoison_filter_dev_minor;
 extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
+extern u64 hwpoison_filter_memcg;
Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -112,6 +112,13 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
+#ifdef	CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
+				    hwpoison_dir, &hwpoison_filter_memcg);
+	if (!dentry)
+		goto fail;
+#endif
+
 	return 0;
 fail:
 	pfn_inject_exit();
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -123,6 +123,22 @@ Only handle memory failures to pages ass
 by block device major/minor.  -1U is the wildcard value.
 This should be only used for testing with artificial injection.
 
+corrupt-filter-memcg
+
+Limit injection to pages owned by memgroup. Specified by inode number
+of the memcg.
+
+Example:
+        mkdir /cgroup/hwpoison
+
+        usemem -m 100 -s 1000 &
+        echo `jobs -p` > /cgroup/hwpoison/tasks
+
+        memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
+        echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
+
+        page-types -p `pidof init`   --hwpoison  # shall do nothing
+        page-types -p `pidof usemem` --hwpoison  # poison its pages
 
 corrupt-filter-flags-mask
 corrupt-filter-flags-value

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [24/31] HWPOISON: add an interface to switch off/on all the page filters
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (22 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [23/31] HWPOISON: add memory cgroup filter Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [25/31] HWPOISON: Don't do early filtering if filter is disabled Andi Kleen
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: haicheng.li, fengguang.wu, linux-kernel, linux-mm


From: Haicheng Li <haicheng.li@linux.intel.com>

In some use cases, user doesn't need extra filtering. E.g. user program
can inject errors through madvise syscall to its own pages, however it
might not know what the page state exactly is or which inode the page
belongs to.

So introduce an one-off interface "corrupt-filter-enable".

Echo 0 to switch off page filters, and echo 1 to switch on the filters.
[AK: changed default to 0]

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/hwpoison-inject.c |    5 +++++
 mm/internal.h        |    1 +
 mm/memory-failure.c  |    5 +++++
 3 files changed, 11 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -92,6 +92,11 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
+	dentry = debugfs_create_u32("corrupt-filter-enable", 0600,
+				    hwpoison_dir, &hwpoison_filter_enable);
+	if (!dentry)
+		goto fail;
+
 	dentry = debugfs_create_u32("corrupt-filter-dev-major", 0600,
 				    hwpoison_dir, &hwpoison_filter_dev_major);
 	if (!dentry)
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -271,3 +271,4 @@ extern u32 hwpoison_filter_dev_minor;
 extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
+extern u32 hwpoison_filter_enable;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -49,10 +49,12 @@ int sysctl_memory_failure_recovery __rea
 
 atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+u32 hwpoison_filter_enable = 0;
 u32 hwpoison_filter_dev_major = ~0U;
 u32 hwpoison_filter_dev_minor = ~0U;
 u64 hwpoison_filter_flags_mask;
 u64 hwpoison_filter_flags_value;
+EXPORT_SYMBOL_GPL(hwpoison_filter_enable);
 EXPORT_SYMBOL_GPL(hwpoison_filter_dev_major);
 EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
 EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
@@ -141,6 +143,9 @@ static int hwpoison_filter_task(struct p
 
 int hwpoison_filter(struct page *p)
 {
+	if (!hwpoison_filter_enable)
+		return 0;
+
 	if (hwpoison_filter_dev(p))
 		return -EINVAL;
 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [25/31] HWPOISON: Don't do early filtering if filter is disabled
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (23 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [24/31] HWPOISON: add an interface to switch off/on all the page filters Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [26/31] HWPOISON: mention HWPoison in Kconfig entry Andi Kleen
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/hwpoison-inject.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -18,6 +18,8 @@ static int hwpoison_inject(void *data, u
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (!hwpoison_filter_enable)
+		goto inject;
 	if (!pfn_valid(pfn))
 		return -ENXIO;
 
@@ -48,6 +50,7 @@ static int hwpoison_inject(void *data, u
 	if (err)
 		return 0;
 
+inject:
 	printk(KERN_INFO "Injecting memory failure at pfn %lx\n", pfn);
 	return __memory_failure(pfn, 18, MF_COUNT_INCREASED);
 }

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [26/31] HWPOISON: mention HWPoison in Kconfig entry
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (24 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [25/31] HWPOISON: Don't do early filtering if filter is disabled Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [27/31] HWPOISON: Use correct name for MADV_HWPOISON in documentation Andi Kleen
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig
+++ linux/mm/Kconfig
@@ -257,7 +257,7 @@ config MEMORY_FAILURE
 	  special hardware support and typically ECC memory.
 
 config HWPOISON_INJECT
-	tristate "Poison pages injector"
+	tristate "HWPoison pages injector"
 	depends on MEMORY_FAILURE && DEBUG_KERNEL
 
 config NOMMU_INITIAL_TRIM_EXCESS

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [27/31] HWPOISON: Use correct name for MADV_HWPOISON in documentation
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (25 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [26/31] HWPOISON: mention HWPoison in Kconfig entry Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [28/31] HWPOISON: Use new shake_page in memory_failure Andi Kleen
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/vm/hwpoison.txt |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -92,7 +92,7 @@ PR_MCE_KILL_GET
 
 Testing:
 
-madvise(MADV_POISON, ....)
+madvise(MADV_HWPOISON, ....)
 	(as root)
 	Poison a page in the process for testing
 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [28/31] HWPOISON: Use new shake_page in memory_failure
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (26 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [27/31] HWPOISON: Use correct name for MADV_HWPOISON in documentation Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [29/31] HWPOISON: Undefine short-hand macros after use to avoid namespace conflict Andi Kleen
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


shake_page handles more types of page caches than
the much simpler lru_add_drain_all:

- slab (quite inefficiently for now)
- any other caches with a shrinker callback
- per cpu page allocator pages
- per CPU LRU

Use this call to try to turn pages into free or LRU pages.
Then handle the case of the page becoming free after drain everything.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -934,8 +934,15 @@ int __memory_failure(unsigned long pfn,
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
 	if (!PageLRU(p))
-		lru_add_drain_all();
+		shake_page(p);
 	if (!PageLRU(p)) {
+		/*
+		 * shake_page could have turned it free.
+		 */
+		if (is_free_buddy_page(p)) {
+			action_result(pfn, "free buddy, 2nd try", DELAYED);
+			return 0;
+		}
 		action_result(pfn, "non LRU", IGNORED);
 		put_page(p);
 		return -EBUSY;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [29/31] HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (27 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [28/31] HWPOISON: Use new shake_page in memory_failure Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [30/31] HWPOISON: Add soft page offline support Andi Kleen
  2009-12-08 21:16 ` [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining Andi Kleen
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory-failure.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -747,6 +747,19 @@ static struct page_state {
 	{ 0,		0,		"unknown page state",	me_unknown },
 };
 
+#undef dirty
+#undef sc
+#undef unevict
+#undef mlock
+#undef writeback
+#undef lru
+#undef swapbacked
+#undef head
+#undef tail
+#undef compound
+#undef slab
+#undef reserved
+
 static void action_result(unsigned long pfn, char *msg, int result)
 {
 	struct page *page = pfn_to_page(pfn);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [30/31] HWPOISON: Add soft page offline support
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (28 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [29/31] HWPOISON: Undefine short-hand macros after use to avoid namespace conflict Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2009-12-08 21:16 ` [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining Andi Kleen
  30 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


This is a simpler, gentler variant of memory_failure() for soft page 
offlining controlled from user space.  It doesn't kill anything, just
tries to invalidate and if that doesn't work migrate the
page away. 

This is useful for predictive failure analysis, where a page has 
a high rate of corrected errors, but hasn't gone bad yet. Instead 
it can be offlined early and avoided.

The offlining is controlled from sysfs, including a new generic
entry point for hard page offlining for symmetry too.

We use the page isolate facility to prevent re-allocation
race. Normally this is only used by memory hotplug. To avoid
races with memory allocation I am using lock_system_sleep().
This avoids the situation where memory hotplug is about
to isolate a page range and then hwpoison undoes that work.
This is a big hammer currently, but the simplest solution
currently.

When the page is not free or LRU we try to free pages
from slab and other caches. The slab freeing is currently
quite dumb and does not try to focus on the specific slab
cache which might own the page. This could be potentially
improved later.

OPEN: nfs migration hangs (broken low level migration op)

Thanks to Fengguang Wu and Haicheng Li for some fixes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Documentation/ABI/testing/sysfs-memory-page-offline |   44 ++++
 drivers/base/memory.c                               |   61 ++++++
 include/linux/mm.h                                  |    3 
 mm/hwpoison-inject.c                                |    2 
 mm/memory-failure.c                                 |  194 +++++++++++++++++++-
 5 files changed, 297 insertions(+), 7 deletions(-)

Index: linux/drivers/base/memory.c
===================================================================
--- linux.orig/drivers/base/memory.c
+++ linux/drivers/base/memory.c
@@ -341,6 +341,64 @@ static inline int memory_probe_init(void
 }
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for offlining pages of memory
+ */
+
+/* Soft offline a page */
+static ssize_t
+store_soft_offline_page(struct class *class, const char *buf, size_t count)
+{
+	int ret;
+	u64 pfn;
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (strict_strtoull(buf, 0, &pfn) < 0)
+		return -EINVAL;
+	pfn >>= PAGE_SHIFT;
+	if (!pfn_valid(pfn))
+		return -ENXIO;
+	ret = soft_offline_page(pfn_to_page(pfn), 0);
+	return ret == 0 ? count : ret;
+}
+
+/* Forcibly offline a page, including killing processes. */
+static ssize_t
+store_hard_offline_page(struct class *class, const char *buf, size_t count)
+{
+	int ret;
+	u64 pfn;
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (strict_strtoull(buf, 0, &pfn) < 0)
+		return -EINVAL;
+	pfn >>= PAGE_SHIFT;
+	ret = __memory_failure(pfn, 0, 0);
+	return ret ? ret : count;
+}
+
+static CLASS_ATTR(soft_offline_page, 0644, NULL, store_soft_offline_page);
+static CLASS_ATTR(hard_offline_page, 0644, NULL, store_hard_offline_page);
+
+static __init int memory_fail_init(void)
+{
+	int err;
+
+	err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
+				&class_attr_soft_offline_page.attr);
+	if (!err)
+		err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
+				&class_attr_hard_offline_page.attr);
+	return err;
+}
+#else
+static inline int memory_fail_init(void)
+{
+	return 0;
+}
+#endif
+
 /*
  * Note that phys_device is optional.  It is here to allow for
  * differentiation between which *physical* devices each
@@ -473,6 +531,9 @@ int __init memory_dev_init(void)
 	err = memory_probe_init();
 	if (!ret)
 		ret = err;
+	err = memory_fail_init();
+	if (!ret)
+		ret = err;
 	err = block_size_init();
 	if (!ret)
 		ret = err;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -41,6 +41,9 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/backing-dev.h>
+#include <linux/migrate.h>
+#include <linux/page-isolation.h>
+#include <linux/suspend.h>
 #include "internal.h"
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -197,7 +200,7 @@ static int kill_proc_ao(struct task_stru
  * When a unknown page type is encountered drain as many buffers as possible
  * in the hope to turn the page into a LRU or free page, which we can handle.
  */
-void shake_page(struct page *p)
+void shake_page(struct page *p, int access)
 {
 	if (!PageSlab(p)) {
 		lru_add_drain_all();
@@ -207,11 +210,19 @@ void shake_page(struct page *p)
 		if (PageLRU(p) || is_free_buddy_page(p))
 			return;
 	}
+
 	/*
-	 * Could call shrink_slab here (which would also
-	 * shrink other caches). Unfortunately that might
-	 * also access the corrupted page, which could be fatal.
+	 * Only all shrink_slab here (which would also
+	 * shrink other caches) if access is not potentially fatal.
 	 */
+	if (access) {
+		int nr;
+		do {
+			nr = shrink_slab(1000, GFP_KERNEL, 1000);
+			if (page_count(p) == 0)
+				break;
+		} while (nr > 10);
+	}
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
@@ -947,7 +958,7 @@ int __memory_failure(unsigned long pfn,
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
 	if (!PageLRU(p))
-		shake_page(p);
+		shake_page(p, 0);
 	if (!PageLRU(p)) {
 		/*
 		 * shake_page could have turned it free.
@@ -1097,3 +1108,176 @@ int unpoison_memory(unsigned long pfn)
 	return 0;
 }
 EXPORT_SYMBOL(unpoison_memory);
+
+static struct page *new_page(struct page *p, unsigned long private, int **x)
+{
+	return alloc_pages(GFP_HIGHUSER_MOVABLE, 0);
+}
+
+/*
+ * Safely get reference count of an arbitrary page.
+ * Returns 0 for a free page, -EIO for a zero refcount page
+ * that is not free, and 1 for any other page type.
+ * For 1 the page is returned with increased page count, otherwise not.
+ */
+static int get_any_page(struct page *p, unsigned long pfn, int flags)
+{
+	int ret;
+
+	if (flags & MF_COUNT_INCREASED)
+		return 1;
+
+	/*
+	 * The lock_system_sleep prevents a race with memory hotplug,
+	 * because the isolation assumes there's only a single user.
+	 * This is a big hammer, a better would be nicer.
+	 */
+	lock_system_sleep();
+
+	/*
+	 * Isolate the page, so that it doesn't get reallocated if it
+	 * was free.
+	 */
+	set_migratetype_isolate(p);
+	if (!get_page_unless_zero(compound_head(p))) {
+		if (is_free_buddy_page(p)) {
+			pr_debug("get_any_page: %#lx free buddy page\n", pfn);
+			/* Set hwpoison bit while page is still isolated */
+			SetPageHWPoison(p);
+			ret = 0;
+		} else {
+			pr_debug("get_any_page: %#lx: unknown zero refcount page type %lx\n",
+				pfn, p->flags);
+			ret = -EIO;
+		}
+	} else {
+		/* Not a free page */
+		ret = 1;
+	}
+	unset_migratetype_isolate(p);
+	unlock_system_sleep();
+	return ret;
+}
+
+/**
+ * soft_offline_page - Soft offline a page.
+ * @page: page to offline
+ * @flags: flags. Same as memory_failure().
+ *
+ * Returns 0 on success, otherwise negated errno.
+ *
+ * Soft offline a page, by migration or invalidation,
+ * without killing anything. This is for the case when
+ * a page is not corrupted yet (so it's still valid to access),
+ * but has had a number of corrected errors and is better taken
+ * out.
+ *
+ * The actual policy on when to do that is maintained by
+ * user space.
+ *
+ * This should never impact any application or cause data loss,
+ * however it might take some time.
+ *
+ * This is not a 100% solution for all memory, but tries to be
+ * ``good enough'' for the majority of memory.
+ */
+int soft_offline_page(struct page *page, int flags)
+{
+	int ret;
+	unsigned long pfn = page_to_pfn(page);
+
+	ret = get_any_page(page, pfn, flags);
+	if (ret < 0)
+		return ret;
+	if (ret == 0)
+		goto done;
+
+	/*
+	 * Page cache page we can handle?
+	 */
+	if (!PageLRU(page)) {
+		/*
+		 * Try to free it.
+		 */
+		put_page(page);
+		shake_page(page, 1);
+
+		/*
+		 * Did it turn free?
+		 */
+		ret = get_any_page(page, pfn, flags);
+		if (ret < 0)
+			return ret;
+		if (ret == 0)
+			goto done;
+	}
+	if (!PageLRU(page)) {
+		pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n",
+				pfn, page->flags);
+		return -EIO;
+	}
+
+	lock_page(page);
+	wait_on_page_writeback(page);
+
+	/*
+	 * Synchronized using the page lock with memory_failure()
+	 */
+	if (PageHWPoison(page)) {
+		unlock_page(page);
+		put_page(page);
+		pr_debug("soft offline: %#lx page already poisoned\n", pfn);
+		return -EBUSY;
+	}
+
+	/*
+	 * Try to invalidate first. This should work for
+	 * non dirty unmapped page cache pages.
+	 */
+	ret = invalidate_inode_page(page);
+	unlock_page(page);
+
+	/*
+	 * Drop count because page migration doesn't like raised
+	 * counts. The page could get re-allocated, but if it becomes
+	 * LRU the isolation will just fail.
+	 * RED-PEN would be better to keep it isolated here, but we
+	 * would need to fix isolation locking first.
+	 */
+	put_page(page);
+	if (ret == 1) {
+		ret = 0;
+		pr_debug("soft_offline: %#lx: invalidated\n", pfn);
+		goto done;
+	}
+
+	/*
+	 * Simple invalidation didn't work.
+	 * Try to migrate to a new page instead. migrate.c
+	 * handles a large number of cases for us.
+	 */
+	ret = isolate_lru_page(page);
+	if (!ret) {
+		LIST_HEAD(pagelist);
+
+		list_add(&page->lru, &pagelist);
+		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL);
+		if (ret) {
+			pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
+				pfn, ret, page->flags);
+			if (ret > 0)
+				ret = -EIO;
+		}
+	} else {
+		pr_debug("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n",
+				pfn, ret, page_count(page), page->flags);
+	}
+	if (ret)
+		return ret;
+
+done:
+	atomic_long_add(1, &mce_bad_pages);
+	SetPageHWPoison(page);
+	/* keep elevated page count for bad page */
+	return ret;
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1324,8 +1324,9 @@ extern int __memory_failure(unsigned lon
 extern int unpoison_memory(unsigned long pfn);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
-extern void shake_page(struct page *p);
+extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
+extern int soft_offline_page(struct page *page, int flags);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux/Documentation/ABI/testing/sysfs-memory-page-offline
===================================================================
--- /dev/null
+++ linux/Documentation/ABI/testing/sysfs-memory-page-offline
@@ -0,0 +1,44 @@
+What:		/sys/devices/system/memory/soft_offline_page
+Date:		Sep 2009
+KernelVersion:	2.6.33
+Contact:	andi@firstfloor.org
+Description:
+		Soft-offline the memory page containing the physical address
+		written into this file. Input is a hex number specifying the
+		physical address of the page. The kernel will then attempt
+		to soft-offline it, by moving the contents elsewhere or
+		dropping it if possible. The kernel will then be placed
+		on the bad page list and never be reused.
+
+		The offlining is done in kernel specific granuality.
+		Normally it's the base page size of the kernel, but
+		this might change.
+
+		The page must be still accessible, not poisoned. The
+		kernel will never kill anything for this, but rather
+		fail the offline.  Return value is the size of the
+		number, or a error when the offlining failed.  Reading
+		the file is not allowed.
+
+What:		/sys/devices/system/memory/hard_offline_page
+Date:		Sep 2009
+KernelVersion:	2.6.33
+Contact:	andi@firstfloor.org
+Description:
+		Hard-offline the memory page containing the physical
+		address written into this file. Input is a hex number
+		specifying the physical address of the page. The
+		kernel will then attempt to hard-offline the page, by
+		trying to drop the page or killing any owner or
+		triggering IO errors if needed.  Note this may kill
+		any processes owning the page. The kernel will avoid
+		to access this page assuming it's poisoned by the
+		hardware.
+
+		The offlining is done in kernel specific granuality.
+		Normally it's the base page size of the kernel, but
+		this might change.
+
+		Return value is the size of the number, or a error when
+		the offlining failed.
+		Reading the file is not allowed.
Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -29,7 +29,7 @@ static int hwpoison_inject(void *data, u
 		return 0;
 
 	if (!PageLRU(p))
-		shake_page(p);
+		shake_page(p, 0);
 	/*
 	 * This implies unable to support non-LRU pages.
 	 */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining
  2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
                   ` (29 preceding siblings ...)
  2009-12-08 21:16 ` [PATCH] [30/31] HWPOISON: Add soft page offline support Andi Kleen
@ 2009-12-08 21:16 ` Andi Kleen
  2010-06-19 12:36   ` Michael Kerrisk
  30 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2009-12-08 21:16 UTC (permalink / raw)
  To: fengguang.wu, linux-kernel, linux-mm


Process based injection is much easier to handle for test programs,
who can first bring a page into a specific state and then test.
So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
to the existing hard offline injector.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman-common.h |    1 +
 mm/madvise.c                      |   15 ++++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

Index: linux/include/asm-generic/mman-common.h
===================================================================
--- linux.orig/include/asm-generic/mman-common.h
+++ linux/include/asm-generic/mman-common.h
@@ -35,6 +35,7 @@
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
 #define MADV_HWPOISON	100		/* poison a page for testing */
+#define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c
+++ linux/mm/madvise.c
@@ -9,6 +9,7 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/mempolicy.h>
+#include <linux/page-isolation.h>
 #include <linux/hugetlb.h>
 #include <linux/sched.h>
 #include <linux/ksm.h>
@@ -222,7 +223,7 @@ static long madvise_remove(struct vm_are
 /*
  * Error injection support for memory error handling.
  */
-static int madvise_hwpoison(unsigned long start, unsigned long end)
+static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
 {
 	int ret = 0;
 
@@ -233,6 +234,14 @@ static int madvise_hwpoison(unsigned lon
 		int ret = get_user_pages_fast(start, 1, 0, &p);
 		if (ret != 1)
 			return ret;
+		if (bhv == MADV_SOFT_OFFLINE) {
+			printk(KERN_INFO "Soft offlining page %lx at %lx\n",
+				page_to_pfn(p), start);
+			ret = soft_offline_page(p, MF_COUNT_INCREASED);
+			if (ret)
+				break;
+			continue;
+		}
 		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
 		       page_to_pfn(p), start);
 		/* Ignore return value for now */
@@ -333,8 +342,8 @@ SYSCALL_DEFINE3(madvise, unsigned long,
 	size_t len;
 
 #ifdef CONFIG_MEMORY_FAILURE
-	if (behavior == MADV_HWPOISON)
-		return madvise_hwpoison(start, start+len_in);
+	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+		return madvise_hwpoison(behavior, start, start+len_in);
 #endif
 	if (!madvise_behavior_valid(behavior))
 		return error;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [19/31] mm: export stable page flags
  2009-12-08 21:16 ` [PATCH] [19/31] mm: export stable page flags Andi Kleen
@ 2009-12-08 22:27   ` Matt Mackall
  2009-12-09  2:00     ` Wu Fengguang
  0 siblings, 1 reply; 61+ messages in thread
From: Matt Mackall @ 2009-12-08 22:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, npiggin, cl, linux-kernel, linux-mm

On Tue, 2009-12-08 at 22:16 +0100, Andi Kleen wrote:
> From: Wu Fengguang <fengguang.wu@intel.com>
> 
> Rename get_uflags() to stable_page_flags() and make it a global function
> for use in the hwpoison page flags filter, which need to compare user
> page flags with the value provided by user space.
> 
> Also move KPF_* to kernel-page-flags.h for use by user space tools.
> 
> CC: Matt Mackall <mpm@selenic.com>
> CC: Nick Piggin <npiggin@suse.de>
> CC: Christoph Lameter <cl@linux-foundation.org>
> CC: Andi Kleen <andi@firstfloor.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Matt Mackall <mpm@selenic.com>
-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [19/31] mm: export stable page flags
  2009-12-08 22:27   ` Matt Mackall
@ 2009-12-09  2:00     ` Wu Fengguang
  2009-12-09 21:38       ` Matt Mackall
  2009-12-10  1:50       ` Andi Kleen
  0 siblings, 2 replies; 61+ messages in thread
From: Wu Fengguang @ 2009-12-09  2:00 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, npiggin, cl, linux-kernel, linux-mm

On Wed, Dec 09, 2009 at 06:27:31AM +0800, Matt Mackall wrote:
> On Tue, 2009-12-08 at 22:16 +0100, Andi Kleen wrote:
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > Rename get_uflags() to stable_page_flags() and make it a global function
> > for use in the hwpoison page flags filter, which need to compare user
> > page flags with the value provided by user space.
> > 
> > Also move KPF_* to kernel-page-flags.h for use by user space tools.
> > 
> > CC: Matt Mackall <mpm@selenic.com>
> > CC: Nick Piggin <npiggin@suse.de>
> > CC: Christoph Lameter <cl@linux-foundation.org>
> > CC: Andi Kleen <andi@firstfloor.org>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> Acked-by: Matt Mackall <mpm@selenic.com>

Andi and Matt,

Sorry the stable_page_flags() will be undefined on
!CONFIG_PROC_PAGE_MONITOR (it is almost always on,
except for some embedded systems).

Currently the easy solution is to add a Kconfig dependency to
CONFIG_PROC_PAGE_MONITOR.  When there comes more users (ie. some
ftrace event), we can then always compile in stable_page_flags().

Thanks,
Fengguang
---
 mm/Kconfig          |    1 +
 mm/memory-failure.c |    4 ++++
 2 files changed, 5 insertions(+)

--- linux-mm.orig/mm/Kconfig	2009-12-09 09:47:51.000000000 +0800
+++ linux-mm/mm/Kconfig	2009-12-09 09:58:54.000000000 +0800
@@ -259,6 +259,7 @@ config MEMORY_FAILURE
 config HWPOISON_INJECT
 	tristate "HWPoison pages injector"
 	depends on MEMORY_FAILURE && DEBUG_KERNEL
+	depends on PROC_PAGE_MONITOR
 
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
--- linux-mm.orig/mm/memory-failure.c	2009-12-09 09:49:13.000000000 +0800
+++ linux-mm/mm/memory-failure.c	2009-12-09 09:55:42.000000000 +0800
@@ -51,6 +51,7 @@ int sysctl_memory_failure_recovery __rea
 
 atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+#ifdef CONFIG_HWPOISON_INJECT
 u32 hwpoison_filter_enable = 1;
 u32 hwpoison_filter_dev_major = ~0U;
 u32 hwpoison_filter_dev_minor = ~0U;
@@ -160,6 +161,9 @@ int hwpoison_filter(struct page *p)
 	return 0;
 }
 EXPORT_SYMBOL_GPL(hwpoison_filter);
+#else
+int hwpoison_filter(struct page *p) { return 0; }
+#endif
 
 /*
  * Send all the processes who have the page mapped an ``action optional''

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-08 21:16 ` [PATCH] [23/31] HWPOISON: add memory cgroup filter Andi Kleen
@ 2009-12-09  5:04   ` Li Zefan
  2009-12-09  5:06     ` KAMEZAWA Hiroyuki
  2009-12-09  9:15     ` Andi Kleen
  2009-12-09 20:47   ` Paul Menage
  1 sibling, 2 replies; 61+ messages in thread
From: Li Zefan @ 2009-12-09  5:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kosaki.motohiro, hugh.dickins, nishimura, balbir,
	kamezawa.hiroyu, menage, npiggin, fengguang.wu, linux-kernel,
	linux-mm

> +#ifdef	CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +u64 hwpoison_filter_memcg;
> +EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
> +static int hwpoison_filter_task(struct page *p)
> +{
> +	struct mem_cgroup *mem;
> +	struct cgroup_subsys_state *css;
> +	unsigned long ino;
> +
> +	if (!hwpoison_filter_memcg)
> +		return 0;
> +
> +	mem = try_get_mem_cgroup_from_page(p);
> +	if (!mem)
> +		return -EINVAL;
> +
> +	css = mem_cgroup_css(mem);
> +	ino = css->cgroup->dentry->d_inode->i_ino;

I have a question, can try_get_mem_cgroup_from_page() return
root_mem_cgroup?

if it can, then css->cgroup->dentry is NULL, if memcg is
not mounted and there is no subdir in memcg. Because the root
cgroup of an inactive subsystem has no dentry.

> +	css_put(css);
> +
> +	if (ino != hwpoison_filter_memcg)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +#else
> +static int hwpoison_filter_task(struct page *p) { return 0; }
> +#endif

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-09  5:04   ` Li Zefan
@ 2009-12-09  5:06     ` KAMEZAWA Hiroyuki
  2009-12-09  5:33       ` Balbir Singh
  2009-12-09  9:15     ` Andi Kleen
  1 sibling, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-09  5:06 UTC (permalink / raw)
  To: Li Zefan
  Cc: Andi Kleen, kosaki.motohiro, hugh.dickins, nishimura, balbir,
	menage, npiggin, fengguang.wu, linux-kernel, linux-mm

On Wed, 09 Dec 2009 13:04:06 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > +#ifdef	CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +u64 hwpoison_filter_memcg;
> > +EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
> > +static int hwpoison_filter_task(struct page *p)
> > +{
> > +	struct mem_cgroup *mem;
> > +	struct cgroup_subsys_state *css;
> > +	unsigned long ino;
> > +
> > +	if (!hwpoison_filter_memcg)
> > +		return 0;
> > +
> > +	mem = try_get_mem_cgroup_from_page(p);
> > +	if (!mem)
> > +		return -EINVAL;
> > +
> > +	css = mem_cgroup_css(mem);
> > +	ino = css->cgroup->dentry->d_inode->i_ino;
> 
> I have a question, can try_get_mem_cgroup_from_page() return
> root_mem_cgroup?
> 
yes.

> if it can, then css->cgroup->dentry is NULL, if memcg is
> not mounted and there is no subdir in memcg. Because the root
> cgroup of an inactive subsystem has no dentry.
> 

Nice catch. It sounds possible. That should be handled.

Regards,
-Kame


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-09  5:06     ` KAMEZAWA Hiroyuki
@ 2009-12-09  5:33       ` Balbir Singh
  0 siblings, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2009-12-09  5:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Li Zefan, Andi Kleen, kosaki.motohiro, hugh.dickins, nishimura,
	menage, npiggin, fengguang.wu, linux-kernel, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-12-09 14:06:20]:

> On Wed, 09 Dec 2009 13:04:06 +0800
> Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
> > > +#ifdef	CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > +u64 hwpoison_filter_memcg;
> > > +EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
> > > +static int hwpoison_filter_task(struct page *p)
> > > +{
> > > +	struct mem_cgroup *mem;
> > > +	struct cgroup_subsys_state *css;
> > > +	unsigned long ino;
> > > +
> > > +	if (!hwpoison_filter_memcg)
> > > +		return 0;
> > > +
> > > +	mem = try_get_mem_cgroup_from_page(p);
> > > +	if (!mem)
> > > +		return -EINVAL;
> > > +
> > > +	css = mem_cgroup_css(mem);
> > > +	ino = css->cgroup->dentry->d_inode->i_ino;
> > 
> > I have a question, can try_get_mem_cgroup_from_page() return
> > root_mem_cgroup?
> > 
> yes.
> 
> > if it can, then css->cgroup->dentry is NULL, if memcg is
> > not mounted and there is no subdir in memcg. Because the root
> > cgroup of an inactive subsystem has no dentry.
> > 
> 
> Nice catch. It sounds possible. That should be handled.
> 
> 

Yes, agreed, good catch!

-- 
	Balbir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-09  5:04   ` Li Zefan
  2009-12-09  5:06     ` KAMEZAWA Hiroyuki
@ 2009-12-09  9:15     ` Andi Kleen
  1 sibling, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-09  9:15 UTC (permalink / raw)
  To: Li Zefan
  Cc: Andi Kleen, kosaki.motohiro, hugh.dickins, nishimura, balbir,
	kamezawa.hiroyu, menage, npiggin, fengguang.wu, linux-kernel,
	linux-mm

> I have a question, can try_get_mem_cgroup_from_page() return
> root_mem_cgroup?

It could be called for any page.


> if it can, then css->cgroup->dentry is NULL, if memcg is
> not mounted and there is no subdir in memcg. Because the root
> cgroup of an inactive subsystem has no dentry.

Thanks. I'll just add an return -EINVAL for this case, sounds good?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-08 21:16 ` [PATCH] [23/31] HWPOISON: add memory cgroup filter Andi Kleen
  2009-12-09  5:04   ` Li Zefan
@ 2009-12-09 20:47   ` Paul Menage
  2009-12-09 23:56     ` KAMEZAWA Hiroyuki
  2009-12-10  1:42     ` Andi Kleen
  1 sibling, 2 replies; 61+ messages in thread
From: Paul Menage @ 2009-12-09 20:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kosaki.motohiro, hugh.dickins, nishimura, balbir,
	kamezawa.hiroyu, lizf, npiggin, fengguang.wu, linux-kernel,
	linux-mm

On Tue, Dec 8, 2009 at 1:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> The hwpoison test suite need to inject hwpoison to a collection of
> selected task pages, and must not touch pages not owned by them and
> thus kill important system processes such as init. (But it's OK to
> mis-hwpoison free/unowned pages as well as shared clean pages.
> Mis-hwpoison of shared dirty pages will kill all tasks, so the test
> suite will target all or non of such tasks in the first place.)

While the functionality sounds useful, the interface (passing an inode
number) feels a bit ugly to me. Also, if that group is deleted and a
new cgroup created, you could end up reusing the inode number.

How about an approach where you write either the cgroup path (relative
to the memcg mount) or an fd open on the desired cgroup? Then you
could store a (counted) css reference rather than an inode number,
which would make the filter function cleaner too, since it would just
need to compare css objects.

Paul

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [19/31] mm: export stable page flags
  2009-12-09  2:00     ` Wu Fengguang
@ 2009-12-09 21:38       ` Matt Mackall
  2009-12-10  1:50       ` Andi Kleen
  1 sibling, 0 replies; 61+ messages in thread
From: Matt Mackall @ 2009-12-09 21:38 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, npiggin, cl, linux-kernel, linux-mm

On Wed, 2009-12-09 at 10:00 +0800, Wu Fengguang wrote:
> On Wed, Dec 09, 2009 at 06:27:31AM +0800, Matt Mackall wrote:
> > On Tue, 2009-12-08 at 22:16 +0100, Andi Kleen wrote:
> > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > 
> > > Rename get_uflags() to stable_page_flags() and make it a global function
> > > for use in the hwpoison page flags filter, which need to compare user
> > > page flags with the value provided by user space.
> > > 
> > > Also move KPF_* to kernel-page-flags.h for use by user space tools.
> > > 
> > > CC: Matt Mackall <mpm@selenic.com>
> > > CC: Nick Piggin <npiggin@suse.de>
> > > CC: Christoph Lameter <cl@linux-foundation.org>
> > > CC: Andi Kleen <andi@firstfloor.org>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > Acked-by: Matt Mackall <mpm@selenic.com>
> 
> Andi and Matt,
> 
> Sorry the stable_page_flags() will be undefined on
> !CONFIG_PROC_PAGE_MONITOR (it is almost always on,
> except for some embedded systems).
> 
> Currently the easy solution is to add a Kconfig dependency to
> CONFIG_PROC_PAGE_MONITOR.  When there comes more users (ie. some
> ftrace event), we can then always compile in stable_page_flags().

No objections.

Acked-by: Matt Mackall <mpm@selenic.com>

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-09 20:47   ` Paul Menage
@ 2009-12-09 23:56     ` KAMEZAWA Hiroyuki
  2009-12-10  1:42     ` Andi Kleen
  1 sibling, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-09 23:56 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andi Kleen, kosaki.motohiro, hugh.dickins, nishimura, balbir,
	lizf, npiggin, fengguang.wu, linux-kernel, linux-mm

On Wed, 9 Dec 2009 12:47:27 -0800
Paul Menage <menage@google.com> wrote:

> On Tue, Dec 8, 2009 at 1:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > The hwpoison test suite need to inject hwpoison to a collection of
> > selected task pages, and must not touch pages not owned by them and
> > thus kill important system processes such as init. (But it's OK to
> > mis-hwpoison free/unowned pages as well as shared clean pages.
> > Mis-hwpoison of shared dirty pages will kill all tasks, so the test
> > suite will target all or non of such tasks in the first place.)
> 
> While the functionality sounds useful, the interface (passing an inode
> number) feels a bit ugly to me. Also, if that group is deleted and a
> new cgroup created, you could end up reusing the inode number.
> 
I agree.

> How about an approach where you write either the cgroup path (relative
> to the memcg mount) or an fd open on the desired cgroup? Then you
> could store a (counted) css reference rather than an inode number,
> which would make the filter function cleaner too, since it would just
> need to compare css objects.
> 
Sounds reasonable.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-09 20:47   ` Paul Menage
  2009-12-09 23:56     ` KAMEZAWA Hiroyuki
@ 2009-12-10  1:42     ` Andi Kleen
  2009-12-10  2:21       ` Balbir Singh
  1 sibling, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2009-12-10  1:42 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andi Kleen, kosaki.motohiro, hugh.dickins, nishimura, balbir,
	kamezawa.hiroyu, lizf, npiggin, fengguang.wu, linux-kernel,
	linux-mm

> While the functionality sounds useful, the interface (passing an inode
> number) feels a bit ugly to me. Also, if that group is deleted and a
> new cgroup created, you could end up reusing the inode number.

Please note this is just a testing interface, doesn't need to be
100% fool-proof. It'll never be used in production.

> 
> How about an approach where you write either the cgroup path (relative
> to the memcg mount) or an fd open on the desired cgroup? Then you
> could store a (counted) css reference rather than an inode number,
> which would make the filter function cleaner too, since it would just
> need to compare css objects.

Sounds complicated, I assume it would be much more code?
I would prefer to keep the testing interfaces as simple as possible.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [19/31] mm: export stable page flags
  2009-12-09  2:00     ` Wu Fengguang
  2009-12-09 21:38       ` Matt Mackall
@ 2009-12-10  1:50       ` Andi Kleen
  2009-12-10  2:09         ` Wu Fengguang
  1 sibling, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2009-12-10  1:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Matt Mackall, Andi Kleen, npiggin, cl, linux-kernel, linux-mm

> Sorry the stable_page_flags() will be undefined on
> !CONFIG_PROC_PAGE_MONITOR (it is almost always on,
> except for some embedded systems).
> 
> Currently the easy solution is to add a Kconfig dependency to
> CONFIG_PROC_PAGE_MONITOR.  When there comes more users (ie. some
> ftrace event), we can then always compile in stable_page_flags().

I decided to turn it into select instead.

Your original patch didn't handle hwpoison as module btw.

-Andi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [19/31] mm: export stable page flags
  2009-12-10  1:50       ` Andi Kleen
@ 2009-12-10  2:09         ` Wu Fengguang
  2009-12-10 13:42           ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Wu Fengguang @ 2009-12-10  2:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Matt Mackall, npiggin, cl, linux-kernel, linux-mm

On Thu, Dec 10, 2009 at 09:50:14AM +0800, Andi Kleen wrote:
> > Sorry the stable_page_flags() will be undefined on
> > !CONFIG_PROC_PAGE_MONITOR (it is almost always on,
> > except for some embedded systems).
> > 
> > Currently the easy solution is to add a Kconfig dependency to
> > CONFIG_PROC_PAGE_MONITOR.  When there comes more users (ie. some
> > ftrace event), we can then always compile in stable_page_flags().
> 
> I decided to turn it into select instead.

OK, that would be good.

> Your original patch didn't handle hwpoison as module btw.

Yes I realized it later..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-10  1:42     ` Andi Kleen
@ 2009-12-10  2:21       ` Balbir Singh
  2009-12-11  2:14         ` Wu Fengguang
  0 siblings, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2009-12-10  2:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Menage, kosaki.motohiro, hugh.dickins, nishimura,
	kamezawa.hiroyu, lizf, npiggin, fengguang.wu, linux-kernel,
	linux-mm

* Andi Kleen <andi@firstfloor.org> [2009-12-10 02:42:12]:

> > While the functionality sounds useful, the interface (passing an inode
> > number) feels a bit ugly to me. Also, if that group is deleted and a
> > new cgroup created, you could end up reusing the inode number.
> 
> Please note this is just a testing interface, doesn't need to be
> 100% fool-proof. It'll never be used in production.
> 
> > 
> > How about an approach where you write either the cgroup path (relative
> > to the memcg mount) or an fd open on the desired cgroup? Then you
> > could store a (counted) css reference rather than an inode number,
> > which would make the filter function cleaner too, since it would just
> > need to compare css objects.
> 
> Sounds complicated, I assume it would be much more code?
> I would prefer to keep the testing interfaces as simple as possible.
>

We do this for cgroupstats and the code is not very complicated. In
case you want to look, the user space code is at
Documentation/accounting/getdelays.c and the kernel code is in
kernel/taskstats.c 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [19/31] mm: export stable page flags
  2009-12-10  2:09         ` Wu Fengguang
@ 2009-12-10 13:42           ` Andi Kleen
  0 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-10 13:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Matt Mackall, npiggin, cl, linux-kernel, linux-mm

> > Your original patch didn't handle hwpoison as module btw.
> 
> Yes I realized it later..

It's always a big trap, I've seen lots of people (including myself)
run into it.

I preferred to not have the ifdefs in the core module, simply
because it's very little additional code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-10  2:21       ` Balbir Singh
@ 2009-12-11  2:14         ` Wu Fengguang
  2009-12-14 12:53           ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Wu Fengguang @ 2009-12-11  2:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andi Kleen, Paul Menage, kosaki.motohiro, hugh.dickins,
	nishimura, kamezawa.hiroyu, lizf, npiggin, linux-kernel,
	linux-mm, Li, Haicheng

On Thu, Dec 10, 2009 at 10:21:13AM +0800, Balbir Singh wrote:
> * Andi Kleen <andi@firstfloor.org> [2009-12-10 02:42:12]:
> 
> > > While the functionality sounds useful, the interface (passing an inode
> > > number) feels a bit ugly to me. Also, if that group is deleted and a
> > > new cgroup created, you could end up reusing the inode number.
> > 
> > Please note this is just a testing interface, doesn't need to be
> > 100% fool-proof. It'll never be used in production.
> > 
> > > 
> > > How about an approach where you write either the cgroup path (relative
> > > to the memcg mount) or an fd open on the desired cgroup? Then you
> > > could store a (counted) css reference rather than an inode number,
> > > which would make the filter function cleaner too, since it would just
> > > need to compare css objects.
> > 
> > Sounds complicated, I assume it would be much more code?
> > I would prefer to keep the testing interfaces as simple as possible.
> >
> 
> We do this for cgroupstats and the code is not very complicated. In
> case you want to look, the user space code is at
> Documentation/accounting/getdelays.c and the kernel code is in
> kernel/taskstats.c 

Balbir, thanks for the tip.

We could keep an fd open on the desired cgroup, in user space: 

        #!/bin/bash

        mkdir /cgroup/hwpoison && \
        exec 9<>/cgroup/hwpoison/tasks || exit 1

A bit simpler than an in-kernel fget_light() or CSS refcount :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter
  2009-12-11  2:14         ` Wu Fengguang
@ 2009-12-14 12:53           ` Andi Kleen
  0 siblings, 0 replies; 61+ messages in thread
From: Andi Kleen @ 2009-12-14 12:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Balbir Singh, Paul Menage, kosaki.motohiro, hugh.dickins,
	nishimura, kamezawa.hiroyu, lizf, npiggin, linux-kernel,
	linux-mm, Li, Haicheng

Wu Fengguang <fengguang.wu@intel.com> writes:
>
> We could keep an fd open on the desired cgroup, in user space: 
>
>         #!/bin/bash
>
>         mkdir /cgroup/hwpoison && \
>         exec 9<>/cgroup/hwpoison/tasks || exit 1
>
> A bit simpler than an in-kernel fget_light() or CSS refcount :)

FYI, I decided to not do any of this in .33, but just keep the 
ugly-but-working inode hack. We can look at fixing that for .34.
These interfaces are debugfs, so can be changed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2009-12-08 21:16 ` [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining Andi Kleen
@ 2010-06-19 12:36   ` Michael Kerrisk
  2010-06-19 13:20     ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-19 12:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, linux-kernel, linux-mm, Michael Kerrisk

Hi Andi,

On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Process based injection is much easier to handle for test programs,
> who can first bring a page into a specific state and then test.
> So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
> to the existing hard offline injector.

I see that this made its way into 2.6.33. Could you write a short
piece on it for the madvise.2 man page?

Thanks,

Michael


> Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> ---
>  include/asm-generic/mman-common.h |    1 +
>  mm/madvise.c                      |   15 ++++++++++++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
>
> Index: linux/include/asm-generic/mman-common.h
> ===================================================================
> --- linux.orig/include/asm-generic/mman-common.h
> +++ linux/include/asm-generic/mman-common.h
> @@ -35,6 +35,7 @@
>  #define MADV_DONTFORK  10              /* don't inherit across fork */
>  #define MADV_DOFORK    11              /* do inherit across fork */
>  #define MADV_HWPOISON  100             /* poison a page for testing */
> +#define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
>
>  #define MADV_MERGEABLE   12            /* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13            /* KSM may not merge identical pages */
> Index: linux/mm/madvise.c
> ===================================================================
> --- linux.orig/mm/madvise.c
> +++ linux/mm/madvise.c
> @@ -9,6 +9,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/syscalls.h>
>  #include <linux/mempolicy.h>
> +#include <linux/page-isolation.h>
>  #include <linux/hugetlb.h>
>  #include <linux/sched.h>
>  #include <linux/ksm.h>
> @@ -222,7 +223,7 @@ static long madvise_remove(struct vm_are
>  /*
>  * Error injection support for memory error handling.
>  */
> -static int madvise_hwpoison(unsigned long start, unsigned long end)
> +static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
>  {
>        int ret = 0;
>
> @@ -233,6 +234,14 @@ static int madvise_hwpoison(unsigned lon
>                int ret = get_user_pages_fast(start, 1, 0, &p);
>                if (ret != 1)
>                        return ret;
> +               if (bhv == MADV_SOFT_OFFLINE) {
> +                       printk(KERN_INFO "Soft offlining page %lx at %lx\n",
> +                               page_to_pfn(p), start);
> +                       ret = soft_offline_page(p, MF_COUNT_INCREASED);
> +                       if (ret)
> +                               break;
> +                       continue;
> +               }
>                printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
>                       page_to_pfn(p), start);
>                /* Ignore return value for now */
> @@ -333,8 +342,8 @@ SYSCALL_DEFINE3(madvise, unsigned long,
>        size_t len;
>
>  #ifdef CONFIG_MEMORY_FAILURE
> -       if (behavior == MADV_HWPOISON)
> -               return madvise_hwpoison(start, start+len_in);
> +       if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> +               return madvise_hwpoison(behavior, start, start+len_in);
>  #endif
>        if (!madvise_behavior_valid(behavior))
>                return error;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining
  2010-06-19 12:36   ` Michael Kerrisk
@ 2010-06-19 13:20     ` Andi Kleen
  2010-06-19 13:25       ` Michael Kerrisk
  0 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2010-06-19 13:20 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Andi Kleen, fengguang.wu, linux-kernel, linux-mm

On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
> Hi Andi,
> 
> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > Process based injection is much easier to handle for test programs,
> > who can first bring a page into a specific state and then test.
> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
> > to the existing hard offline injector.
> 
> I see that this made its way into 2.6.33. Could you write a short
> piece on it for the madvise.2 man page?

Also fixed the previous snippet slightly.


commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
Author: Andi Kleen <ak@linux.intel.com>
Date:   Sat Jun 19 15:19:28 2010 +0200

    MADV_SOFT_OFFLINE
    
    Signed-off-by: Andi Kleen <ak@linux.intel.com>

diff --git a/man2/madvise.2 b/man2/madvise.2
index db29feb..9dccd97 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -154,7 +154,15 @@ processes.
 This operation may result in the calling process receiving a
 .B SIGBUS
 and the page being unmapped.
-This feature is intended for memory testing.
+This feature is intended for testing of memory error handling code.
+This feature is only available if the kernel was configured with
+.BR CONFIG_MEMORY_FAILURE .
+.TP
+.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
+Soft offline a page. This will result in the memory of the page
+being copied to a new page and original page be offlined. The operation
+should be transparent to the calling process.
+This feature is intended for testing of memory error handling code.
 This feature is only available if the kernel was configured with
 .BR CONFIG_MEMORY_FAILURE .
 .TP


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-19 13:20     ` Andi Kleen
@ 2010-06-19 13:25       ` Michael Kerrisk
  2010-06-19 13:30         ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-19 13:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, linux-kernel, linux-mm

Hi Andi,

Thanks for this. Some comments below.

On Sat, Jun 19, 2010 at 3:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
>> Hi Andi,
>>
>> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >
>> > Process based injection is much easier to handle for test programs,
>> > who can first bring a page into a specific state and then test.
>> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
>> > to the existing hard offline injector.
>>
>> I see that this made its way into 2.6.33. Could you write a short
>> piece on it for the madvise.2 man page?
>
> Also fixed the previous snippet slightly.

(thanks)

> commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
> Author: Andi Kleen <ak@linux.intel.com>
> Date:   Sat Jun 19 15:19:28 2010 +0200
>
>    MADV_SOFT_OFFLINE
>
>    Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index db29feb..9dccd97 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -154,7 +154,15 @@ processes.
>  This operation may result in the calling process receiving a
>  .B SIGBUS
>  and the page being unmapped.
> -This feature is intended for memory testing.
> +This feature is intended for testing of memory error handling code.
> +This feature is only available if the kernel was configured with
> +.BR CONFIG_MEMORY_FAILURE .
> +.TP
> +.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> +Soft offline a page. This will result in the memory of the page
> +being copied to a new page and original page be offlined. The operation

Can you explain the term "offlined" please.

> +should be transparent to the calling process.

Does "should be transparent" mean "is normally invisible"?

Thanks,

Michael

> +This feature is intended for testing of memory error handling code.
>  This feature is only available if the kernel was configured with
>  .BR CONFIG_MEMORY_FAILURE .
>  .TP
>
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining
  2010-06-19 13:25       ` Michael Kerrisk
@ 2010-06-19 13:30         ` Andi Kleen
  2010-06-19 13:43           ` Michael Kerrisk
  0 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2010-06-19 13:30 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Andi Kleen, fengguang.wu, linux-kernel, linux-mm

On Sat, Jun 19, 2010 at 03:25:16PM +0200, Michael Kerrisk wrote:
> Hi Andi,
> 
> Thanks for this. Some comments below.
> 
> On Sat, Jun 19, 2010 at 3:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
> >> Hi Andi,
> >>
> >> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> >
> >> > Process based injection is much easier to handle for test programs,
> >> > who can first bring a page into a specific state and then test.
> >> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
> >> > to the existing hard offline injector.
> >>
> >> I see that this made its way into 2.6.33. Could you write a short
> >> piece on it for the madvise.2 man page?
> >
> > Also fixed the previous snippet slightly.
> 
> (thanks)
> 
> > commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
> > Author: Andi Kleen <ak@linux.intel.com>
> > Date:   Sat Jun 19 15:19:28 2010 +0200
> >
> >    MADV_SOFT_OFFLINE
> >
> >    Signed-off-by: Andi Kleen <ak@linux.intel.com>
> >
> > diff --git a/man2/madvise.2 b/man2/madvise.2
> > index db29feb..9dccd97 100644
> > --- a/man2/madvise.2
> > +++ b/man2/madvise.2
> > @@ -154,7 +154,15 @@ processes.
> >  This operation may result in the calling process receiving a
> >  .B SIGBUS
> >  and the page being unmapped.
> > -This feature is intended for memory testing.
> > +This feature is intended for testing of memory error handling code.
> > +This feature is only available if the kernel was configured with
> > +.BR CONFIG_MEMORY_FAILURE .
> > +.TP
> > +.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> > +Soft offline a page. This will result in the memory of the page
> > +being copied to a new page and original page be offlined. The operation
> 
> Can you explain the term "offlined" please.

The memory is not used anymore and taken out of normal
memory management (until unpoisoned) 
and the "HardwareCorrupted:" counter in /proc/meminfo increases

(don't put the later in, I'm thinking about changing that)

> 
> > +should be transparent to the calling process.
> 
> Does "should be transparent" mean "is normally invisible"?

Yes. It's similar to being swapped out and swapped in again.

-Andi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-19 13:30         ` Andi Kleen
@ 2010-06-19 13:43           ` Michael Kerrisk
  2010-06-19 14:09             ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-19 13:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, linux-kernel, linux-mm

Hi Andi,

On Sat, Jun 19, 2010 at 3:30 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Sat, Jun 19, 2010 at 03:25:16PM +0200, Michael Kerrisk wrote:
>> Hi Andi,
>>
>> Thanks for this. Some comments below.
>>
>> On Sat, Jun 19, 2010 at 3:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> > On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
>> >> Hi Andi,
>> >>
>> >> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >> >
>> >> > Process based injection is much easier to handle for test programs,
>> >> > who can first bring a page into a specific state and then test.
>> >> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
>> >> > to the existing hard offline injector.
>> >>
>> >> I see that this made its way into 2.6.33. Could you write a short
>> >> piece on it for the madvise.2 man page?
>> >
>> > Also fixed the previous snippet slightly.
>>
>> (thanks)
>>
>> > commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
>> > Author: Andi Kleen <ak@linux.intel.com>
>> > Date:   Sat Jun 19 15:19:28 2010 +0200
>> >
>> >    MADV_SOFT_OFFLINE
>> >
>> >    Signed-off-by: Andi Kleen <ak@linux.intel.com>
>> >
>> > diff --git a/man2/madvise.2 b/man2/madvise.2
>> > index db29feb..9dccd97 100644
>> > --- a/man2/madvise.2
>> > +++ b/man2/madvise.2
>> > @@ -154,7 +154,15 @@ processes.
>> >  This operation may result in the calling process receiving a
>> >  .B SIGBUS
>> >  and the page being unmapped.
>> > -This feature is intended for memory testing.
>> > +This feature is intended for testing of memory error handling code.
>> > +This feature is only available if the kernel was configured with
>> > +.BR CONFIG_MEMORY_FAILURE .
>> > +.TP
>> > +.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> > +Soft offline a page. This will result in the memory of the page
>> > +being copied to a new page and original page be offlined. The operation
>>
>> Can you explain the term "offlined" please.
>
> The memory is not used anymore and taken out of normal
> memory management (until unpoisoned)

Is there a userspace operation to unpoison (i.e., reverse MADV_SOFT_OFFLINE)?

I ask because I wondered if there is something additional to be documented.

> and the "HardwareCorrupted:" counter in /proc/meminfo increases
>
> (don't put the later in, I'm thinking about changing that)

Okay.

>>
>> > +should be transparent to the calling process.
>>
>> Does "should be transparent" mean "is normally invisible"?
>
> Yes. It's similar to being swapped out and swapped in again.

Okay.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining
  2010-06-19 13:43           ` Michael Kerrisk
@ 2010-06-19 14:09             ` Andi Kleen
  2010-06-19 14:17               ` Michael Kerrisk
  0 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2010-06-19 14:09 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Andi Kleen, fengguang.wu, linux-kernel, linux-mm

On Sat, Jun 19, 2010 at 03:43:28PM +0200, Michael Kerrisk wrote:
> Is there a userspace operation to unpoison (i.e., reverse MADV_SOFT_OFFLINE)?

Yes, but it's only a debugfs interface currently.

> I ask because I wondered if there is something additional to be documented.

I don't think debugfs needs manpages atm.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-19 14:09             ` Andi Kleen
@ 2010-06-19 14:17               ` Michael Kerrisk
  2010-06-19 19:52                 ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-19 14:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, linux-kernel, linux-mm

Hi Andi,

On Sat, Jun 19, 2010 at 4:09 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Sat, Jun 19, 2010 at 03:43:28PM +0200, Michael Kerrisk wrote:
>> Is there a userspace operation to unpoison (i.e., reverse MADV_SOFT_OFFLINE)?
>
> Yes, but it's only a debugfs interface currently.

Okay -- thanks.

>> I ask because I wondered if there is something additional to be documented.
>
> I don't think debugfs needs manpages atm.

Okay.

I edited your text somewhat. Could you please review the below.

Cheers,

Michael

.TP
.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
Soft offline the pages in the range specified by
.I addr
and
.IR length .
This memory of each page in the specified range is copied to a new page,
and the original page is offlined
(i.e., no longer used, and taken out of normal memory management).
The effect of the
.B MADV_SOFT_OFFLINE
operation is normally invisible to (i.e., does not change the semantics of)
the calling process.
This feature is intended for testing of memory error-handling code;
it is only available if the kernel was configured with
.BR CONFIG_MEMORY_FAILURE .

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining
  2010-06-19 14:17               ` Michael Kerrisk
@ 2010-06-19 19:52                 ` Andi Kleen
  2010-06-20  6:19                   ` Michael Kerrisk
  0 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2010-06-19 19:52 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Andi Kleen, fengguang.wu, linux-kernel, linux-mm

> .TP
> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> Soft offline the pages in the range specified by
> .I addr
> and
> .IR length .
> This memory of each page in the specified range is copied to a new page,

Actually there are some cases where it's also dropped if it's cached page.

Perhaps better would be something more fuzzy like

"the contents are preserved"

> and the original page is offlined
> (i.e., no longer used, and taken out of normal memory management).

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-19 19:52                 ` Andi Kleen
@ 2010-06-20  6:19                   ` Michael Kerrisk
  2010-06-20  7:14                     ` Wu Fengguang
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-20  6:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: fengguang.wu, linux-kernel, linux-mm

Hi Andi,
On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> .TP
>> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> Soft offline the pages in the range specified by
>> .I addr
>> and
>> .IR length .
>> This memory of each page in the specified range is copied to a new page,
>
> Actually there are some cases where it's also dropped if it's cached page.
>
> Perhaps better would be something more fuzzy like
>
> "the contents are preserved"

The problem to me is that this gets so fuzzy that it's hard to
understand the meaning (I imagine many readers will ask: "What does it
mean that the contents are preserved"?). Would you be able to come up
with a wording that is a little miore detailed?

>> and the original page is offlined
>> (i.e., no longer used, and taken out of normal memory management).

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-20  6:19                   ` Michael Kerrisk
@ 2010-06-20  7:14                     ` Wu Fengguang
  2010-06-26 13:18                       ` Michael Kerrisk
  0 siblings, 1 reply; 61+ messages in thread
From: Wu Fengguang @ 2010-06-20  7:14 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Andi Kleen, linux-kernel, linux-mm

On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
> Hi Andi,
> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> .TP
> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> >> Soft offline the pages in the range specified by
> >> .I addr
> >> and
> >> .IR length .
> >> This memory of each page in the specified range is copied to a new page,
> >
> > Actually there are some cases where it's also dropped if it's cached page.
> >
> > Perhaps better would be something more fuzzy like
> >
> > "the contents are preserved"
> 
> The problem to me is that this gets so fuzzy that it's hard to
> understand the meaning (I imagine many readers will ask: "What does it
> mean that the contents are preserved"?). Would you be able to come up
> with a wording that is a little miore detailed?

That is, MADV_SOFT_OFFLINE won't lose data.

If a process writes "1" to some virtual address and then called
madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
to read "1" from that virtual address.

MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
frame with a new one that contains the same data "1". The original page
frame is offlined, and the new page frame may be installed lazily.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-20  7:14                     ` Wu Fengguang
@ 2010-06-26 13:18                       ` Michael Kerrisk
  2010-06-26 23:30                         ` Wu Fengguang
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, linux-kernel, linux-mm

Hi Fengguang,

On Sun, Jun 20, 2010 at 9:14 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
>> Hi Andi,
>> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >> .TP
>> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> >> Soft offline the pages in the range specified by
>> >> .I addr
>> >> and
>> >> .IR length .
>> >> This memory of each page in the specified range is copied to a new page,
>> >
>> > Actually there are some cases where it's also dropped if it's cached page.
>> >
>> > Perhaps better would be something more fuzzy like
>> >
>> > "the contents are preserved"
>>
>> The problem to me is that this gets so fuzzy that it's hard to
>> understand the meaning (I imagine many readers will ask: "What does it
>> mean that the contents are preserved"?). Would you be able to come up
>> with a wording that is a little miore detailed?
>
> That is, MADV_SOFT_OFFLINE won't lose data.
>
> If a process writes "1" to some virtual address and then called
> madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
> to read "1" from that virtual address.
>
> MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
> frame with a new one that contains the same data "1". The original page
> frame is offlined, and the new page frame may be installed lazily.

Thanks. That helps me come up with a description that is I think a bit clearer:

       MADV_SOFT_OFFLINE (Since Linux 2.6.33)
              Soft offline the pages in the range specified by
              addr and length.  The memory of each page in the
              specified  range  is  preserved (i.e., when next
              accessed, the same content will be visible,  but
              in  a new physical page frame), and the original
              page is offlined  (i.e.,  no  longer  used,  and
              taken  out  of  normal  memory management).  The
              effect of  the  MADV_SOFT_OFFLINE  operation  is
              invisible  to  (i.e., does not change the seman-
              tics of) the calling process. ...

The actual patch for man-pages-3.26 is below.

Cheers,

Michael

--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -163,12 +163,14 @@ Soft offline the pages in the range specified by
 .I addr
 and
 .IR length .
-The memory of each page in the specified range is copied to a new page,
+The memory of each page in the specified range is preserved
+(i.e., when next accessed, the same content will be visible,
+but in a new physical page frame),
 and the original page is offlined
 (i.e., no longer used, and taken out of normal memory management).
 The effect of the
 .B MADV_SOFT_OFFLINE
-operation is normally invisible to (i.e., does not change the semantics of)
+operation is invisible to (i.e., does not change the semantics of)
 the calling process.
 This feature is intended for testing of memory error-handling code;
 it is only available if the kernel was configured with

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-26 13:18                       ` Michael Kerrisk
@ 2010-06-26 23:30                         ` Wu Fengguang
  2010-06-27  4:38                           ` Michael Kerrisk
  0 siblings, 1 reply; 61+ messages in thread
From: Wu Fengguang @ 2010-06-26 23:30 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Andi Kleen, linux-kernel, linux-mm

On Sat, Jun 26, 2010 at 09:18:52PM +0800, Michael Kerrisk wrote:
> Hi Fengguang,
> 
> On Sun, Jun 20, 2010 at 9:14 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
> >> Hi Andi,
> >> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> >> .TP
> >> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> >> >> Soft offline the pages in the range specified by
> >> >> .I addr
> >> >> and
> >> >> .IR length .
> >> >> This memory of each page in the specified range is copied to a new page,
> >> >
> >> > Actually there are some cases where it's also dropped if it's cached page.
> >> >
> >> > Perhaps better would be something more fuzzy like
> >> >
> >> > "the contents are preserved"
> >>
> >> The problem to me is that this gets so fuzzy that it's hard to
> >> understand the meaning (I imagine many readers will ask: "What does it
> >> mean that the contents are preserved"?). Would you be able to come up
> >> with a wording that is a little miore detailed?
> >
> > That is, MADV_SOFT_OFFLINE won't lose data.
> >
> > If a process writes "1" to some virtual address and then called
> > madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
> > to read "1" from that virtual address.
> >
> > MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
> > frame with a new one that contains the same data "1". The original page
> > frame is offlined, and the new page frame may be installed lazily.
> 
> Thanks. That helps me come up with a description that is I think a bit clearer:
> 
>        MADV_SOFT_OFFLINE (Since Linux 2.6.33)
>               Soft offline the pages in the range specified by
>               addr and length.  The memory of each page in the
>               specified  range  is  preserved (i.e., when next
>               accessed, the same content will be visible,  but
>               in  a new physical page frame), and the original
>               page is offlined  (i.e.,  no  longer  used,  and
>               taken  out  of  normal  memory management).  The
>               effect of  the  MADV_SOFT_OFFLINE  operation  is
>               invisible  to  (i.e., does not change the seman-
>               tics of) the calling process. ...
> 
> The actual patch for man-pages-3.26 is below.

Thanks. The change looks good to me.

Note that the other perceivable change may be a little access delay.
The kernel could choose to simply drop the in-memory data when there
is another copy in disk. When accessed again, the content for the new
physical page will be populated from disk IO.

Thanks,
Fengguang

> 
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -163,12 +163,14 @@ Soft offline the pages in the range specified by
>  .I addr
>  and
>  .IR length .
> -The memory of each page in the specified range is copied to a new page,
> +The memory of each page in the specified range is preserved
> +(i.e., when next accessed, the same content will be visible,
> +but in a new physical page frame),
>  and the original page is offlined
>  (i.e., no longer used, and taken out of normal memory management).
>  The effect of the
>  .B MADV_SOFT_OFFLINE
> -operation is normally invisible to (i.e., does not change the semantics of)
> +operation is invisible to (i.e., does not change the semantics of)
>  the calling process.
>  This feature is intended for testing of memory error-handling code;
>  it is only available if the kernel was configured with

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page  offlining
  2010-06-26 23:30                         ` Wu Fengguang
@ 2010-06-27  4:38                           ` Michael Kerrisk
  0 siblings, 0 replies; 61+ messages in thread
From: Michael Kerrisk @ 2010-06-27  4:38 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, linux-kernel, linux-mm

On Sun, Jun 27, 2010 at 1:30 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Sat, Jun 26, 2010 at 09:18:52PM +0800, Michael Kerrisk wrote:
>> Hi Fengguang,
>>
>> On Sun, Jun 20, 2010 at 9:14 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
>> >> Hi Andi,
>> >> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >> >> .TP
>> >> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> >> >> Soft offline the pages in the range specified by
>> >> >> .I addr
>> >> >> and
>> >> >> .IR length .
>> >> >> This memory of each page in the specified range is copied to a new page,
>> >> >
>> >> > Actually there are some cases where it's also dropped if it's cached page.
>> >> >
>> >> > Perhaps better would be something more fuzzy like
>> >> >
>> >> > "the contents are preserved"
>> >>
>> >> The problem to me is that this gets so fuzzy that it's hard to
>> >> understand the meaning (I imagine many readers will ask: "What does it
>> >> mean that the contents are preserved"?). Would you be able to come up
>> >> with a wording that is a little miore detailed?
>> >
>> > That is, MADV_SOFT_OFFLINE won't lose data.
>> >
>> > If a process writes "1" to some virtual address and then called
>> > madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
>> > to read "1" from that virtual address.
>> >
>> > MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
>> > frame with a new one that contains the same data "1". The original page
>> > frame is offlined, and the new page frame may be installed lazily.
>>
>> Thanks. That helps me come up with a description that is I think a bit clearer:
>>
>>        MADV_SOFT_OFFLINE (Since Linux 2.6.33)
>>               Soft offline the pages in the range specified by
>>               addr and length.  The memory of each page in the
>>               specified  range  is  preserved (i.e., when next
>>               accessed, the same content will be visible,  but
>>               in  a new physical page frame), and the original
>>               page is offlined  (i.e.,  no  longer  used,  and
>>               taken  out  of  normal  memory management).  The
>>               effect of  the  MADV_SOFT_OFFLINE  operation  is
>>               invisible  to  (i.e., does not change the seman-
>>               tics of) the calling process. ...
>>
>> The actual patch for man-pages-3.26 is below.
>
> Thanks. The change looks good to me.

Thanks for checking it.

> Note that the other perceivable change may be a little access delay.
> The kernel could choose to simply drop the in-memory data when there
> is another copy in disk. When accessed again, the content for the new
> physical page will be populated from disk IO.

Yes, I'd suposed as much, but decided that was a detail that probably
didm\t need to be mentioned in tha man page.

Thanks,

Michael


>>
>> --- a/man2/madvise.2
>> +++ b/man2/madvise.2
>> @@ -163,12 +163,14 @@ Soft offline the pages in the range specified by
>>  .I addr
>>  and
>>  .IR length .
>> -The memory of each page in the specified range is copied to a new page,
>> +The memory of each page in the specified range is preserved
>> +(i.e., when next accessed, the same content will be visible,
>> +but in a new physical page frame),
>>  and the original page is offlined
>>  (i.e., no longer used, and taken out of normal memory management).
>>  The effect of the
>>  .B MADV_SOFT_OFFLINE
>> -operation is normally invisible to (i.e., does not change the semantics of)
>> +operation is invisible to (i.e., does not change the semantics of)
>>  the calling process.
>>  This feature is intended for testing of memory error-handling code;
>>  it is only available if the kernel was configured with
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2010-06-27  4:39 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-08 21:16 [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting Andi Kleen
2009-12-08 21:16 ` [PATCH] [1/31] HWPOISON: Add Andi Kleen as hwpoison maintainer to MAINTAINERS Andi Kleen
2009-12-08 21:16 ` [PATCH] [2/31] HWPOISON: Be more aggressive at freeing non LRU caches Andi Kleen
2009-12-08 21:16 ` [PATCH] [3/31] page-types: add standard GPL license header Andi Kleen
2009-12-08 21:16 ` [PATCH] [4/31] HWPOISON: remove the anonymous entry Andi Kleen
2009-12-08 21:16 ` [PATCH] [5/31] HWPOISON: return ENXIO on invalid page number Andi Kleen
2009-12-08 21:16 ` [PATCH] [6/31] HWPOISON: avoid grabbing the page count multiple times during madvise injection Andi Kleen
2009-12-08 21:16 ` [PATCH] [7/31] HWPOISON: Turn ref argument into flags argument Andi Kleen
2009-12-08 21:16 ` [PATCH] [8/31] HWPOISON: abort on failed unmap Andi Kleen
2009-12-08 21:16 ` [PATCH] [9/31] HWPOISON: comment the possible set_page_dirty() race Andi Kleen
2009-12-08 21:16 ` [PATCH] [10/31] HWPOISON: comment dirty swapcache pages Andi Kleen
2009-12-08 21:16 ` [PATCH] [11/31] HWPOISON: introduce delete_from_lru_cache() Andi Kleen
2009-12-08 21:16 ` [PATCH] [12/31] HWPOISON: remove the free buddy page handler Andi Kleen
2009-12-08 21:16 ` [PATCH] [13/31] HWPOISON: detect free buddy pages explicitly Andi Kleen
2009-12-08 21:16 ` [PATCH] [14/31] HWPOISON: Add unpoisoning support Andi Kleen
2009-12-08 21:16 ` [PATCH] [15/31] HWPOISON: make semantics of IGNORED/DELAYED clear Andi Kleen
2009-12-08 21:16 ` [PATCH] [16/31] HWPOISON: return 0 to indicate success reliably Andi Kleen
2009-12-08 21:16 ` [PATCH] [17/31] HWPOISON: add fs/device filters Andi Kleen
2009-12-08 21:16 ` [PATCH] [18/31] HWPOISON: limit hwpoison injector to known page types Andi Kleen
2009-12-08 21:16 ` [PATCH] [19/31] mm: export stable page flags Andi Kleen
2009-12-08 22:27   ` Matt Mackall
2009-12-09  2:00     ` Wu Fengguang
2009-12-09 21:38       ` Matt Mackall
2009-12-10  1:50       ` Andi Kleen
2009-12-10  2:09         ` Wu Fengguang
2009-12-10 13:42           ` Andi Kleen
2009-12-08 21:16 ` [PATCH] [20/31] HWPOISON: add page flags filter Andi Kleen
2009-12-08 21:16 ` [PATCH] [21/31] memcg: rename and export try_get_mem_cgroup_from_page() Andi Kleen
2009-12-08 21:16 ` [PATCH] [22/31] memcg: add accessor to mem_cgroup.css Andi Kleen
2009-12-08 21:16 ` [PATCH] [23/31] HWPOISON: add memory cgroup filter Andi Kleen
2009-12-09  5:04   ` Li Zefan
2009-12-09  5:06     ` KAMEZAWA Hiroyuki
2009-12-09  5:33       ` Balbir Singh
2009-12-09  9:15     ` Andi Kleen
2009-12-09 20:47   ` Paul Menage
2009-12-09 23:56     ` KAMEZAWA Hiroyuki
2009-12-10  1:42     ` Andi Kleen
2009-12-10  2:21       ` Balbir Singh
2009-12-11  2:14         ` Wu Fengguang
2009-12-14 12:53           ` Andi Kleen
2009-12-08 21:16 ` [PATCH] [24/31] HWPOISON: add an interface to switch off/on all the page filters Andi Kleen
2009-12-08 21:16 ` [PATCH] [25/31] HWPOISON: Don't do early filtering if filter is disabled Andi Kleen
2009-12-08 21:16 ` [PATCH] [26/31] HWPOISON: mention HWPoison in Kconfig entry Andi Kleen
2009-12-08 21:16 ` [PATCH] [27/31] HWPOISON: Use correct name for MADV_HWPOISON in documentation Andi Kleen
2009-12-08 21:16 ` [PATCH] [28/31] HWPOISON: Use new shake_page in memory_failure Andi Kleen
2009-12-08 21:16 ` [PATCH] [29/31] HWPOISON: Undefine short-hand macros after use to avoid namespace conflict Andi Kleen
2009-12-08 21:16 ` [PATCH] [30/31] HWPOISON: Add soft page offline support Andi Kleen
2009-12-08 21:16 ` [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining Andi Kleen
2010-06-19 12:36   ` Michael Kerrisk
2010-06-19 13:20     ` Andi Kleen
2010-06-19 13:25       ` Michael Kerrisk
2010-06-19 13:30         ` Andi Kleen
2010-06-19 13:43           ` Michael Kerrisk
2010-06-19 14:09             ` Andi Kleen
2010-06-19 14:17               ` Michael Kerrisk
2010-06-19 19:52                 ` Andi Kleen
2010-06-20  6:19                   ` Michael Kerrisk
2010-06-20  7:14                     ` Wu Fengguang
2010-06-26 13:18                       ` Michael Kerrisk
2010-06-26 23:30                         ` Wu Fengguang
2010-06-27  4:38                           ` Michael Kerrisk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).