[PATCH 0/6] memory error report/recovery for dirty pagecache v3

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] memory error report/recovery for dirty pagecache v3
@ 2014-03-13 21:39 ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patchset tries to solve the following issues related to handling memory
errors on dirty pagecache:
 1. stickiness of error info: in current implementation, the events of
    dirty pagecache memory error are recorded as AS_EIO on page_mapping(page),
    which is not sticky (cleared once checked). As a result, we have a race
    window of ignoring the data lost due to concurrent accesses even if
    your application can handle the error report by itself.
 2. finer granularity: when memory error hits a page of a file, we get the
    error report in accessing to other healthy pages, which is confusing for
    userspace.
 3. overwrite recovery: with fixes on problem 1 and 2, we have a possibility
    to recover from the memory error if applications recreate the date on the
    error page or applications are sure of that data on the error page is not
    important.
These problems are solved by introducing a new pagecache tag to remember
memory errors.

Patch 1 is extending some radix_tree operation to support end parameter,
which is used later.

Patch 2 introduces PAGECACHE_TAG_HWPOISON and solve problem 1 and 2 with it.

Patch 3 implements overwrite recovery to solve problem 3.

Patch 4-6 add a new interface /proc/kpagecache which is helpful when
testing/debugging pagecache related issues like this patchset.
Some sample usespace code and documentation is also added.

I think that we can straightforwardly raplace error reporting for normal
IO error with pagecache tag, and we have a clear benefit of doing so in
finer granurality. And overwrite recovery is also fine for example when
dirty data was lost in write failure. But at first I want review and 
feedback on the base idea.

Previous discussions are available from the URLs:
- v1: http://thread.gmane.org/gmane.linux.kernel/1341433
- v2: http://thread.gmane.org/gmane.linux.kernel.mm/84760

Test code:
  https://github.com/Naoya-Horiguchi/test_memory_error_reporting
---
Summary:

Naoya Horiguchi (6):
      radix-tree: add end_index to support ranged iteration
      mm/memory-failure.c: report and recovery for memory error on dirty pagecache
      mm/memory-failure.c: add code to resolve quasi-hwpoisoned page
      fs/proc/page.c: introduce /proc/kpagecache interface
      tools/vm/page-types.c: add file scanning mode
      Documentation: update Documentation/vm/pagemap.txt

 Documentation/vm/pagemap.txt  |  29 ++++++
 drivers/gpu/drm/qxl/qxl_ttm.c |   2 +-
 fs/proc/page.c                | 106 +++++++++++++++++++
 include/linux/fs.h            |  12 ++-
 include/linux/pagemap.h       |  27 +++++
 include/linux/radix-tree.h    |  31 ++++--
 kernel/irq/irqdomain.c        |   2 +-
 lib/radix-tree.c              |   8 +-
 mm/filemap.c                  |  28 ++++-
 mm/memory-failure.c           | 230 +++++++++++++++++++++++++++++++++++-------
 mm/shmem.c                    |   2 +-
 mm/truncate.c                 |   7 ++
 tools/vm/page-types.c         | 117 ++++++++++++++++++---
 13 files changed, 530 insertions(+), 71 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/6] memory error report/recovery for dirty pagecache v3
@ 2014-03-13 21:39 ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patchset tries to solve the following issues related to handling memory
errors on dirty pagecache:
 1. stickiness of error info: in current implementation, the events of
    dirty pagecache memory error are recorded as AS_EIO on page_mapping(page),
    which is not sticky (cleared once checked). As a result, we have a race
    window of ignoring the data lost due to concurrent accesses even if
    your application can handle the error report by itself.
 2. finer granularity: when memory error hits a page of a file, we get the
    error report in accessing to other healthy pages, which is confusing for
    userspace.
 3. overwrite recovery: with fixes on problem 1 and 2, we have a possibility
    to recover from the memory error if applications recreate the date on the
    error page or applications are sure of that data on the error page is not
    important.
These problems are solved by introducing a new pagecache tag to remember
memory errors.

Patch 1 is extending some radix_tree operation to support end parameter,
which is used later.

Patch 2 introduces PAGECACHE_TAG_HWPOISON and solve problem 1 and 2 with it.

Patch 3 implements overwrite recovery to solve problem 3.

Patch 4-6 add a new interface /proc/kpagecache which is helpful when
testing/debugging pagecache related issues like this patchset.
Some sample usespace code and documentation is also added.

I think that we can straightforwardly raplace error reporting for normal
IO error with pagecache tag, and we have a clear benefit of doing so in
finer granurality. And overwrite recovery is also fine for example when
dirty data was lost in write failure. But at first I want review and 
feedback on the base idea.

Previous discussions are available from the URLs:
- v1: http://thread.gmane.org/gmane.linux.kernel/1341433
- v2: http://thread.gmane.org/gmane.linux.kernel.mm/84760

Test code:
  https://github.com/Naoya-Horiguchi/test_memory_error_reporting
---
Summary:

Naoya Horiguchi (6):
      radix-tree: add end_index to support ranged iteration
      mm/memory-failure.c: report and recovery for memory error on dirty pagecache
      mm/memory-failure.c: add code to resolve quasi-hwpoisoned page
      fs/proc/page.c: introduce /proc/kpagecache interface
      tools/vm/page-types.c: add file scanning mode
      Documentation: update Documentation/vm/pagemap.txt

 Documentation/vm/pagemap.txt  |  29 ++++++
 drivers/gpu/drm/qxl/qxl_ttm.c |   2 +-
 fs/proc/page.c                | 106 +++++++++++++++++++
 include/linux/fs.h            |  12 ++-
 include/linux/pagemap.h       |  27 +++++
 include/linux/radix-tree.h    |  31 ++++--
 kernel/irq/irqdomain.c        |   2 +-
 lib/radix-tree.c              |   8 +-
 mm/filemap.c                  |  28 ++++-
 mm/memory-failure.c           | 230 +++++++++++++++++++++++++++++++++++-------
 mm/shmem.c                    |   2 +-
 mm/truncate.c                 |   7 ++
 tools/vm/page-types.c         | 117 ++++++++++++++++++---
 13 files changed, 530 insertions(+), 71 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] radix-tree: add end_index to support ranged iteration
  2014-03-13 21:39 ` Naoya Horiguchi
@ 2014-03-13 21:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

It's useful if we can run only over a specific index range of radix trees,
which this patch does. This patch changes only radix_tree_for_each_slot()
and radix_tree_for_each_tagged(), because we need it only for them for now.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 drivers/gpu/drm/qxl/qxl_ttm.c |  2 +-
 include/linux/radix-tree.h    | 27 ++++++++++++++++++++-------
 kernel/irq/irqdomain.c        |  2 +-
 lib/radix-tree.c              |  8 ++++----
 mm/filemap.c                  |  4 ++--
 mm/shmem.c                    |  2 +-
 6 files changed, 29 insertions(+), 16 deletions(-)

diff --git v3.14-rc6.orig/drivers/gpu/drm/qxl/qxl_ttm.c v3.14-rc6/drivers/gpu/drm/qxl/qxl_ttm.c
index c7e7e6590c2b..ad477307e732 100644
--- v3.14-rc6.orig/drivers/gpu/drm/qxl/qxl_ttm.c
+++ v3.14-rc6/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -398,7 +398,7 @@ static int qxl_sync_obj_wait(void *sync_obj,
 		struct radix_tree_iter iter;
 		int release_id;
 
-		radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0) {
+		radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0, ~0UL) {
 			struct qxl_release *release;
 
 			release_id = iter.index;
diff --git v3.14-rc6.orig/include/linux/radix-tree.h v3.14-rc6/include/linux/radix-tree.h
index 403940787be1..6e14a8e06105 100644
--- v3.14-rc6.orig/include/linux/radix-tree.h
+++ v3.14-rc6/include/linux/radix-tree.h
@@ -265,6 +265,7 @@ static inline void radix_tree_preload_end(void)
  * @index:	index of current slot
  * @next_index:	next-to-last index for this chunk
  * @tags:	bit-mask for tag-iterating
+ * @end_index:  last index to be scanned
  *
  * This radix tree iterator works in terms of "chunks" of slots.  A chunk is a
  * subinterval of slots contained within one radix tree leaf node.  It is
@@ -277,6 +278,7 @@ struct radix_tree_iter {
 	unsigned long	index;
 	unsigned long	next_index;
 	unsigned long	tags;
+	unsigned long	end_index;
 };
 
 #define RADIX_TREE_ITER_TAG_MASK	0x00FF	/* tag index in lower byte */
@@ -288,10 +290,12 @@ struct radix_tree_iter {
  *
  * @iter:	pointer to iterator state
  * @start:	iteration starting index
+ * @end:	iteration ending index
  * Returns:	NULL
  */
 static __always_inline void **
-radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
+radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start,
+			unsigned long end)
 {
 	/*
 	 * Leave iter->tags uninitialized. radix_tree_next_chunk() will fill it
@@ -303,6 +307,7 @@ radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
 	 */
 	iter->index = 0;
 	iter->next_index = start;
+	iter->end_index = end;
 	return NULL;
 }
 
@@ -352,6 +357,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 		iter->tags >>= 1;
 		if (likely(iter->tags & 1ul)) {
 			iter->index++;
+			if (iter->index > iter->end_index)
+				return NULL;
 			return slot + 1;
 		}
 		if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) {
@@ -359,6 +366,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 
 			iter->tags >>= offset;
 			iter->index += offset + 1;
+			if (iter->index > iter->end_index)
+				return NULL;
 			return slot + offset + 1;
 		}
 	} else {
@@ -367,6 +376,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 		while (size--) {
 			slot++;
 			iter->index++;
+			if (iter->index > iter->end_index)
+				return NULL;
 			if (likely(*slot))
 				return slot;
 			if (flags & RADIX_TREE_ITER_CONTIG) {
@@ -391,7 +402,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * Locks can be released and reacquired between iterations.
  */
 #define radix_tree_for_each_chunk(slot, root, iter, start, flags)	\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+	for (slot = radix_tree_iter_init(iter, start, ~0UL) ;		\
 	      (slot = radix_tree_next_chunk(root, iter, flags)) ;)
 
 /**
@@ -414,11 +425,12 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @root:	the struct radix_tree_root pointer
  * @iter:	the struct radix_tree_iter pointer
  * @start:	iteration starting index
+ * @end:	iteration ending index
  *
  * @slot points to radix tree slot, @iter->index contains its index.
  */
-#define radix_tree_for_each_slot(slot, root, iter, start)		\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+#define radix_tree_for_each_slot(slot, root, iter, start, end)		\
+	for (slot = radix_tree_iter_init(iter, start, end) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;	\
 	     slot = radix_tree_next_slot(slot, iter, 0))
 
@@ -433,7 +445,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @slot points to radix tree slot, @iter->index contains its index.
  */
 #define radix_tree_for_each_contig(slot, root, iter, start)		\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+	for (slot = radix_tree_iter_init(iter, start, ~0UL) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
 				RADIX_TREE_ITER_CONTIG)) ;		\
 	     slot = radix_tree_next_slot(slot, iter,			\
@@ -446,12 +458,13 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @root:	the struct radix_tree_root pointer
  * @iter:	the struct radix_tree_iter pointer
  * @start:	iteration starting index
+ * @end:	iteration ending index
  * @tag:	tag index
  *
  * @slot points to radix tree slot, @iter->index contains its index.
  */
-#define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+#define radix_tree_for_each_tagged(slot, root, iter, start, end, tag)	\
+	for (slot = radix_tree_iter_init(iter, start, end) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
 	     slot = radix_tree_next_slot(slot, iter,			\
diff --git v3.14-rc6.orig/kernel/irq/irqdomain.c v3.14-rc6/kernel/irq/irqdomain.c
index f14033700c25..55fc49b412e1 100644
--- v3.14-rc6.orig/kernel/irq/irqdomain.c
+++ v3.14-rc6/kernel/irq/irqdomain.c
@@ -571,7 +571,7 @@ static int virq_debug_show(struct seq_file *m, void *private)
 	mutex_lock(&irq_domain_mutex);
 	list_for_each_entry(domain, &irq_domain_list, link) {
 		int count = 0;
-		radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0)
+		radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0, ~0UL)
 			count++;
 		seq_printf(m, "%c%-16s  %6u  %10u  %10u  %s\n",
 			   domain == irq_default_domain ? '*' : ' ', domain->name,
diff --git v3.14-rc6.orig/lib/radix-tree.c v3.14-rc6/lib/radix-tree.c
index bd4a8dfdf0b8..487ba9c403d2 100644
--- v3.14-rc6.orig/lib/radix-tree.c
+++ v3.14-rc6/lib/radix-tree.c
@@ -1051,7 +1051,7 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_slot(slot, root, &iter, first_index) {
+	radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
 		results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
 		if (!results[ret])
 			continue;
@@ -1093,7 +1093,7 @@ radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_slot(slot, root, &iter, first_index) {
+	radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
 		results[ret] = slot;
 		if (indices)
 			indices[ret] = iter.index;
@@ -1130,7 +1130,7 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
+	radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
 		results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
 		if (!results[ret])
 			continue;
@@ -1167,7 +1167,7 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
+	radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
 		results[ret] = slot;
 		if (++ret == max_items)
 			break;
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 7a13f6ac5421..8c24eda539d8 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -841,7 +841,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 
 	rcu_read_lock();
 restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
@@ -985,7 +985,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 	rcu_read_lock();
 restart:
 	radix_tree_for_each_tagged(slot, &mapping->page_tree,
-				   &iter, *index, tag) {
+				   &iter, *index, ~0UL, tag) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
diff --git v3.14-rc6.orig/mm/shmem.c v3.14-rc6/mm/shmem.c
index 1f18c9d0d93e..973caa10fe1e 100644
--- v3.14-rc6.orig/mm/shmem.c
+++ v3.14-rc6/mm/shmem.c
@@ -346,7 +346,7 @@ static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
 
 	rcu_read_lock();
 restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/6] radix-tree: add end_index to support ranged iteration
@ 2014-03-13 21:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

It's useful if we can run only over a specific index range of radix trees,
which this patch does. This patch changes only radix_tree_for_each_slot()
and radix_tree_for_each_tagged(), because we need it only for them for now.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 drivers/gpu/drm/qxl/qxl_ttm.c |  2 +-
 include/linux/radix-tree.h    | 27 ++++++++++++++++++++-------
 kernel/irq/irqdomain.c        |  2 +-
 lib/radix-tree.c              |  8 ++++----
 mm/filemap.c                  |  4 ++--
 mm/shmem.c                    |  2 +-
 6 files changed, 29 insertions(+), 16 deletions(-)

diff --git v3.14-rc6.orig/drivers/gpu/drm/qxl/qxl_ttm.c v3.14-rc6/drivers/gpu/drm/qxl/qxl_ttm.c
index c7e7e6590c2b..ad477307e732 100644
--- v3.14-rc6.orig/drivers/gpu/drm/qxl/qxl_ttm.c
+++ v3.14-rc6/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -398,7 +398,7 @@ static int qxl_sync_obj_wait(void *sync_obj,
 		struct radix_tree_iter iter;
 		int release_id;
 
-		radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0) {
+		radix_tree_for_each_slot(slot, &qfence->tree, &iter, 0, ~0UL) {
 			struct qxl_release *release;
 
 			release_id = iter.index;
diff --git v3.14-rc6.orig/include/linux/radix-tree.h v3.14-rc6/include/linux/radix-tree.h
index 403940787be1..6e14a8e06105 100644
--- v3.14-rc6.orig/include/linux/radix-tree.h
+++ v3.14-rc6/include/linux/radix-tree.h
@@ -265,6 +265,7 @@ static inline void radix_tree_preload_end(void)
  * @index:	index of current slot
  * @next_index:	next-to-last index for this chunk
  * @tags:	bit-mask for tag-iterating
+ * @end_index:  last index to be scanned
  *
  * This radix tree iterator works in terms of "chunks" of slots.  A chunk is a
  * subinterval of slots contained within one radix tree leaf node.  It is
@@ -277,6 +278,7 @@ struct radix_tree_iter {
 	unsigned long	index;
 	unsigned long	next_index;
 	unsigned long	tags;
+	unsigned long	end_index;
 };
 
 #define RADIX_TREE_ITER_TAG_MASK	0x00FF	/* tag index in lower byte */
@@ -288,10 +290,12 @@ struct radix_tree_iter {
  *
  * @iter:	pointer to iterator state
  * @start:	iteration starting index
+ * @end:	iteration ending index
  * Returns:	NULL
  */
 static __always_inline void **
-radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
+radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start,
+			unsigned long end)
 {
 	/*
 	 * Leave iter->tags uninitialized. radix_tree_next_chunk() will fill it
@@ -303,6 +307,7 @@ radix_tree_iter_init(struct radix_tree_iter *iter, unsigned long start)
 	 */
 	iter->index = 0;
 	iter->next_index = start;
+	iter->end_index = end;
 	return NULL;
 }
 
@@ -352,6 +357,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 		iter->tags >>= 1;
 		if (likely(iter->tags & 1ul)) {
 			iter->index++;
+			if (iter->index > iter->end_index)
+				return NULL;
 			return slot + 1;
 		}
 		if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) {
@@ -359,6 +366,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 
 			iter->tags >>= offset;
 			iter->index += offset + 1;
+			if (iter->index > iter->end_index)
+				return NULL;
 			return slot + offset + 1;
 		}
 	} else {
@@ -367,6 +376,8 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 		while (size--) {
 			slot++;
 			iter->index++;
+			if (iter->index > iter->end_index)
+				return NULL;
 			if (likely(*slot))
 				return slot;
 			if (flags & RADIX_TREE_ITER_CONTIG) {
@@ -391,7 +402,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * Locks can be released and reacquired between iterations.
  */
 #define radix_tree_for_each_chunk(slot, root, iter, start, flags)	\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+	for (slot = radix_tree_iter_init(iter, start, ~0UL) ;		\
 	      (slot = radix_tree_next_chunk(root, iter, flags)) ;)
 
 /**
@@ -414,11 +425,12 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @root:	the struct radix_tree_root pointer
  * @iter:	the struct radix_tree_iter pointer
  * @start:	iteration starting index
+ * @end:	iteration ending index
  *
  * @slot points to radix tree slot, @iter->index contains its index.
  */
-#define radix_tree_for_each_slot(slot, root, iter, start)		\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+#define radix_tree_for_each_slot(slot, root, iter, start, end)		\
+	for (slot = radix_tree_iter_init(iter, start, end) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;	\
 	     slot = radix_tree_next_slot(slot, iter, 0))
 
@@ -433,7 +445,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @slot points to radix tree slot, @iter->index contains its index.
  */
 #define radix_tree_for_each_contig(slot, root, iter, start)		\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+	for (slot = radix_tree_iter_init(iter, start, ~0UL) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
 				RADIX_TREE_ITER_CONTIG)) ;		\
 	     slot = radix_tree_next_slot(slot, iter,			\
@@ -446,12 +458,13 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
  * @root:	the struct radix_tree_root pointer
  * @iter:	the struct radix_tree_iter pointer
  * @start:	iteration starting index
+ * @end:	iteration ending index
  * @tag:	tag index
  *
  * @slot points to radix tree slot, @iter->index contains its index.
  */
-#define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
-	for (slot = radix_tree_iter_init(iter, start) ;			\
+#define radix_tree_for_each_tagged(slot, root, iter, start, end, tag)	\
+	for (slot = radix_tree_iter_init(iter, start, end) ;		\
 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
 	     slot = radix_tree_next_slot(slot, iter,			\
diff --git v3.14-rc6.orig/kernel/irq/irqdomain.c v3.14-rc6/kernel/irq/irqdomain.c
index f14033700c25..55fc49b412e1 100644
--- v3.14-rc6.orig/kernel/irq/irqdomain.c
+++ v3.14-rc6/kernel/irq/irqdomain.c
@@ -571,7 +571,7 @@ static int virq_debug_show(struct seq_file *m, void *private)
 	mutex_lock(&irq_domain_mutex);
 	list_for_each_entry(domain, &irq_domain_list, link) {
 		int count = 0;
-		radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0)
+		radix_tree_for_each_slot(slot, &domain->revmap_tree, &iter, 0, ~0UL)
 			count++;
 		seq_printf(m, "%c%-16s  %6u  %10u  %10u  %s\n",
 			   domain == irq_default_domain ? '*' : ' ', domain->name,
diff --git v3.14-rc6.orig/lib/radix-tree.c v3.14-rc6/lib/radix-tree.c
index bd4a8dfdf0b8..487ba9c403d2 100644
--- v3.14-rc6.orig/lib/radix-tree.c
+++ v3.14-rc6/lib/radix-tree.c
@@ -1051,7 +1051,7 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_slot(slot, root, &iter, first_index) {
+	radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
 		results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
 		if (!results[ret])
 			continue;
@@ -1093,7 +1093,7 @@ radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_slot(slot, root, &iter, first_index) {
+	radix_tree_for_each_slot(slot, root, &iter, first_index, ~0UL) {
 		results[ret] = slot;
 		if (indices)
 			indices[ret] = iter.index;
@@ -1130,7 +1130,7 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
+	radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
 		results[ret] = indirect_to_ptr(rcu_dereference_raw(*slot));
 		if (!results[ret])
 			continue;
@@ -1167,7 +1167,7 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 	if (unlikely(!max_items))
 		return 0;
 
-	radix_tree_for_each_tagged(slot, root, &iter, first_index, tag) {
+	radix_tree_for_each_tagged(slot, root, &iter, first_index, ~0UL, tag) {
 		results[ret] = slot;
 		if (++ret == max_items)
 			break;
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 7a13f6ac5421..8c24eda539d8 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -841,7 +841,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 
 	rcu_read_lock();
 restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
@@ -985,7 +985,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 	rcu_read_lock();
 restart:
 	radix_tree_for_each_tagged(slot, &mapping->page_tree,
-				   &iter, *index, tag) {
+				   &iter, *index, ~0UL, tag) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
diff --git v3.14-rc6.orig/mm/shmem.c v3.14-rc6/mm/shmem.c
index 1f18c9d0d93e..973caa10fe1e 100644
--- v3.14-rc6.orig/mm/shmem.c
+++ v3.14-rc6/mm/shmem.c
@@ -346,7 +346,7 @@ static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
 
 	rcu_read_lock();
 restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start, ~0UL) {
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] mm/memory-failure.c: report and recovery for memory error on dirty pagecache
  2014-03-13 21:39 ` Naoya Horiguchi
@ 2014-03-13 21:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch implements dirty pagecache error handling with a new pagecache
tag, which is set on the error address in pagecache of the affected file.

Before this patch, memory errors on dirty pagecache were reported only
insufficiently due to non-stickiness of AS_EIO which is cleared once checked.
As a result, the newest data on dirty page might be lost. This could happen
even if the applications are well written to handle the error report because
accesses to the error address can happen concurrently. In addition to
stickiness, the granularity of error containment is also problematic.
AS_EIO is mapping wide flag, so a whole file is tainted by a single error,
which is not desirable. These problems are solved with a new pagecache tag.

In pagecache tag approach, we have to allocate another page and link it
to pagecache tree at the error address in order to keep radix_tree_node
for the address on memory, which makes code complex. But it helps us to
introduce error recovery with full page overwrite (added in later patch.)

Unifying error reporting between memory error and normal IO errors is ideal
in a long run, but at first let's solve it separately. I hope that some code
in this patch will be helpful when thinking of the unification.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/fs.h         |   3 +
 include/linux/pagemap.h    |  11 +++
 include/linux/radix-tree.h |   4 ++
 mm/filemap.c               |  14 ++++
 mm/memory-failure.c        | 170 +++++++++++++++++++++++++++++++++++----------
 5 files changed, 167 insertions(+), 35 deletions(-)

diff --git v3.14-rc6.orig/include/linux/fs.h v3.14-rc6/include/linux/fs.h
index 60829565e552..1e8966919044 100644
--- v3.14-rc6.orig/include/linux/fs.h
+++ v3.14-rc6/include/linux/fs.h
@@ -475,6 +475,9 @@ struct block_device {
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
 #define PAGECACHE_TAG_TOWRITE	2
+#ifdef CONFIG_MEMORY_FAILURE
+#define PAGECACHE_TAG_HWPOISON	3
+#endif
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git v3.14-rc6.orig/include/linux/pagemap.h v3.14-rc6/include/linux/pagemap.h
index 70adf09a4cfc..5e234d0d0baf 100644
--- v3.14-rc6.orig/include/linux/pagemap.h
+++ v3.14-rc6/include/linux/pagemap.h
@@ -586,4 +586,15 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+bool mapping_hwpoisoned_range(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte);
+#else
+static inline bool mapping_hwpoisoned_range(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	return false;
+}
+#endif /* CONFIG_MEMORY_FAILURE */
+
 #endif /* _LINUX_PAGEMAP_H */
diff --git v3.14-rc6.orig/include/linux/radix-tree.h v3.14-rc6/include/linux/radix-tree.h
index 6e14a8e06105..9bbc36eb5fc5 100644
--- v3.14-rc6.orig/include/linux/radix-tree.h
+++ v3.14-rc6/include/linux/radix-tree.h
@@ -58,7 +58,11 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
+#ifdef CONFIG_MEMORY_FAILURE
+#define RADIX_TREE_MAX_TAGS 4
+#else
 #define RADIX_TREE_MAX_TAGS 3
+#endif
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 8c24eda539d8..887f2dfaf185 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -285,6 +285,12 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 	if (end_byte < start_byte)
 		goto out;
 
+	if (unlikely(mapping_hwpoisoned_range(mapping, start_byte,
+					      end_byte + 1))) {
+		ret = -EHWPOISON;
+		goto out;
+	}
+
 	pagevec_init(&pvec, 0);
 	while ((index <= end) &&
 			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
@@ -1133,6 +1139,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (unlikely(PageHWPoison(page))) {
+			error = -EHWPOISON;
+			goto readpage_error;
+		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -2100,6 +2110,10 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
         if (unlikely(*pos < 0))
                 return -EINVAL;
 
+	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
+					      *pos + *count)))
+		return -EHWPOISON;
+
 	if (!isblk) {
 		/* FIXME: this is for backwards compatibility with 2.4 */
 		if (file->f_flags & O_APPEND)
diff --git v3.14-rc6.orig/mm/memory-failure.c v3.14-rc6/mm/memory-failure.c
index 1feeff9770cd..34f2c046af22 100644
--- v3.14-rc6.orig/mm/memory-failure.c
+++ v3.14-rc6/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
 #include <linux/kfifo.h>
+#include <linux/pagevec.h>
 #include "internal.h"
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -611,55 +612,154 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
 }
 
 /*
+ * Check PAGECACHE_TAG_HWPOISON within a given address range, and return
+ * true if we find at least one page with the tag set.
+ */
+bool mapping_hwpoisoned_range(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	void **slot;
+	struct radix_tree_iter iter;
+	pgoff_t start_index;
+	pgoff_t end_index = 0;
+	bool hwpoisoned = false;
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	start_index = start_byte >> PAGE_CACHE_SHIFT;
+	if (end_byte > 0)
+		end_index = (end_byte - 1) >> PAGE_CACHE_SHIFT;
+	rcu_read_lock();
+	radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
+			start_index, end_index, PAGECACHE_TAG_HWPOISON) {
+		hwpoisoned = true;
+		break;
+	}
+	rcu_read_unlock();
+	return hwpoisoned;
+}
+
+static bool get_pagecache_tag_hwpoison(struct address_space *mapping,
+					pgoff_t index)
+{
+	bool tag;
+	rcu_read_lock();
+	tag = radix_tree_tag_get(&mapping->page_tree, index,
+				 PAGECACHE_TAG_HWPOISON);
+	rcu_read_unlock();
+	return tag;
+}
+
+static void set_pagecache_tag_hwpoison(struct address_space *mapping,
+					pgoff_t idx)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_set(&mapping->page_tree, idx, PAGECACHE_TAG_HWPOISON);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
+static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
+					pgoff_t idx)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(&mapping->page_tree, idx, PAGECACHE_TAG_HWPOISON);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
+/*
  * Dirty pagecache page
+ *
+ * Memory error reporting (important especially on dirty pagecache error
+ * because dirty data is lost) with AS_EIO flag has some problems:
+ *  1) AS_EIO is not sticky, so when a thread received an error report and
+ *     failed to take proper actions with it, the error flag will be lost
+ *     and other threads read/write with old data from storage and use it
+ *     as if no memory error happens.
+ *  2) mapping->flags is file-wide information, while the memory error is an
+ *     event on a single page. So we lose the info about where in the file
+ *     was corrupted.
+ *  3) Even dirty pagecache error can be recoverable if there is a copy data
+ *     of the newest version in user processes' buffers, but with AS_EIO
+ *     we can't handle that case.
+ *
+ * To solve these, we handle dirty pagecache errors by replacing the error
+ * page with alternative one which has PAGECACHE_TAG_HWPOISON at the page
+ * index on mapping->page_tree set. Although setting PAGECACHE_TAG_HWPOISON
+ * is enough for its purpose, we also set PG_HWPoison for users to find the
+ * page easily (for example with tools/vm/page-types.c.) The page looks
+ * similar to a normal hwpoisoned page, but it's not isolated (connected to
+ * pagecache), or the memory at the physical address is not really corrupted.
+ *
+ * This quasi-hwpoisoned page works to keep reporting the error for all
+ * processes which try to access to the error address until it is resolved
+ * or the system reboots.
+ *
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
  */
 static int me_pagecache_dirty(struct page *p, unsigned long pfn)
 {
+	int ret;
 	struct address_space *mapping = page_mapping(p);
+	pgoff_t index;
+	struct inode *inode = NULL;
+	struct page *new;
 
 	SetPageError(p);
-	/* TBD: print more information about the file. */
 	if (mapping) {
+		index = page_index(p);
+		/*
+		 * we take inode refcount to keep it's pagecache or mapping
+		 * on the memory until the error is resolved.
+		 */
+		inode = igrab(mapping->host);
+		pr_info("MCE %#lx: memory error on dirty pagecache (page offset:%lu, inode:%lu, dev:%s)\n",
+			page_to_pfn(p), index, inode->i_ino, inode->i_sb->s_id);
+	}
+
+	ret = me_pagecache_clean(p, pfn);
+
+	if (inode) {
+		/*
+		 * There's a potential race where some other thread can
+		 * allocate another page and add it at the error address of
+		 * the mapping, before the below code adds an alternative
+		 * (quasi-hwpoisoned) page. In such case, we detect it by
+		 * the failure of add_to_page_cache_lru(), and we give up
+		 * doing error containment (fallback to old the AS_EIO things).
+		 */
+		new = page_cache_alloc_cold(mapping);
+		if (!new)
+			goto out_iput;
+		ret = add_to_page_cache_lru(new, mapping, page_index(p),
+					    GFP_KERNEL);
+		if (ret)
+			goto out_put_page;
 		/*
-		 * IO error will be reported by write(), fsync(), etc.
-		 * who check the mapping.
-		 * This way the application knows that something went
-		 * wrong with its dirty file data.
-		 *
-		 * There's one open issue:
-		 *
-		 * The EIO will be only reported on the next IO
-		 * operation and then cleared through the IO map.
-		 * Normally Linux has two mechanisms to pass IO error
-		 * first through the AS_EIO flag in the address space
-		 * and then through the PageError flag in the page.
-		 * Since we drop pages on memory failure handling the
-		 * only mechanism open to use is through AS_AIO.
-		 *
-		 * This has the disadvantage that it gets cleared on
-		 * the first operation that returns an error, while
-		 * the PageError bit is more sticky and only cleared
-		 * when the page is reread or dropped.  If an
-		 * application assumes it will always get error on
-		 * fsync, but does other operations on the fd before
-		 * and the page is dropped between then the error
-		 * will not be properly reported.
-		 *
-		 * This can already happen even without hwpoisoned
-		 * pages: first on metadata IO errors (which only
-		 * report through AS_EIO) or when the page is dropped
-		 * at the wrong time.
-		 *
-		 * So right now we assume that the application DTRT on
-		 * the first EIO, but we're not worse than other parts
-		 * of the kernel.
+		 * Newly allocated page can remain on pagevec, so without
+		 * draining it subsequent isolation doesn't work.
 		 */
-		mapping_set_error(mapping, EIO);
+		lru_add_drain_all();
+		if (isolate_lru_page(new))
+			goto out;
+		inc_zone_page_state(new, NR_ISOLATED_ANON +
+				    page_is_file_cache(new));
+		SetPageHWPoison(new);
+		page_cache_release(new);
+		set_pagecache_tag_hwpoison(mapping, page_index(p));
+		unlock_page(new);
+		ret = RECOVERED;
 	}
+	return ret;
 
-	return me_pagecache_clean(p, pfn);
+out:
+	delete_from_page_cache(new);
+	unlock_page(new);
+out_put_page:
+	page_cache_release(new);
+out_iput:
+	iput(mapping->host);
+	mapping_set_error(mapping, EIO);
+	return FAILED;
 }
 
 /*
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] mm/memory-failure.c: report and recovery for memory error on dirty pagecache
@ 2014-03-13 21:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch implements dirty pagecache error handling with a new pagecache
tag, which is set on the error address in pagecache of the affected file.

Before this patch, memory errors on dirty pagecache were reported only
insufficiently due to non-stickiness of AS_EIO which is cleared once checked.
As a result, the newest data on dirty page might be lost. This could happen
even if the applications are well written to handle the error report because
accesses to the error address can happen concurrently. In addition to
stickiness, the granularity of error containment is also problematic.
AS_EIO is mapping wide flag, so a whole file is tainted by a single error,
which is not desirable. These problems are solved with a new pagecache tag.

In pagecache tag approach, we have to allocate another page and link it
to pagecache tree at the error address in order to keep radix_tree_node
for the address on memory, which makes code complex. But it helps us to
introduce error recovery with full page overwrite (added in later patch.)

Unifying error reporting between memory error and normal IO errors is ideal
in a long run, but at first let's solve it separately. I hope that some code
in this patch will be helpful when thinking of the unification.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/fs.h         |   3 +
 include/linux/pagemap.h    |  11 +++
 include/linux/radix-tree.h |   4 ++
 mm/filemap.c               |  14 ++++
 mm/memory-failure.c        | 170 +++++++++++++++++++++++++++++++++++----------
 5 files changed, 167 insertions(+), 35 deletions(-)

diff --git v3.14-rc6.orig/include/linux/fs.h v3.14-rc6/include/linux/fs.h
index 60829565e552..1e8966919044 100644
--- v3.14-rc6.orig/include/linux/fs.h
+++ v3.14-rc6/include/linux/fs.h
@@ -475,6 +475,9 @@ struct block_device {
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
 #define PAGECACHE_TAG_TOWRITE	2
+#ifdef CONFIG_MEMORY_FAILURE
+#define PAGECACHE_TAG_HWPOISON	3
+#endif
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git v3.14-rc6.orig/include/linux/pagemap.h v3.14-rc6/include/linux/pagemap.h
index 70adf09a4cfc..5e234d0d0baf 100644
--- v3.14-rc6.orig/include/linux/pagemap.h
+++ v3.14-rc6/include/linux/pagemap.h
@@ -586,4 +586,15 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+bool mapping_hwpoisoned_range(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte);
+#else
+static inline bool mapping_hwpoisoned_range(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	return false;
+}
+#endif /* CONFIG_MEMORY_FAILURE */
+
 #endif /* _LINUX_PAGEMAP_H */
diff --git v3.14-rc6.orig/include/linux/radix-tree.h v3.14-rc6/include/linux/radix-tree.h
index 6e14a8e06105..9bbc36eb5fc5 100644
--- v3.14-rc6.orig/include/linux/radix-tree.h
+++ v3.14-rc6/include/linux/radix-tree.h
@@ -58,7 +58,11 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
+#ifdef CONFIG_MEMORY_FAILURE
+#define RADIX_TREE_MAX_TAGS 4
+#else
 #define RADIX_TREE_MAX_TAGS 3
+#endif
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 8c24eda539d8..887f2dfaf185 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -285,6 +285,12 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 	if (end_byte < start_byte)
 		goto out;
 
+	if (unlikely(mapping_hwpoisoned_range(mapping, start_byte,
+					      end_byte + 1))) {
+		ret = -EHWPOISON;
+		goto out;
+	}
+
 	pagevec_init(&pvec, 0);
 	while ((index <= end) &&
 			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
@@ -1133,6 +1139,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (unlikely(PageHWPoison(page))) {
+			error = -EHWPOISON;
+			goto readpage_error;
+		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -2100,6 +2110,10 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
         if (unlikely(*pos < 0))
                 return -EINVAL;
 
+	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
+					      *pos + *count)))
+		return -EHWPOISON;
+
 	if (!isblk) {
 		/* FIXME: this is for backwards compatibility with 2.4 */
 		if (file->f_flags & O_APPEND)
diff --git v3.14-rc6.orig/mm/memory-failure.c v3.14-rc6/mm/memory-failure.c
index 1feeff9770cd..34f2c046af22 100644
--- v3.14-rc6.orig/mm/memory-failure.c
+++ v3.14-rc6/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
 #include <linux/kfifo.h>
+#include <linux/pagevec.h>
 #include "internal.h"
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -611,55 +612,154 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
 }
 
 /*
+ * Check PAGECACHE_TAG_HWPOISON within a given address range, and return
+ * true if we find at least one page with the tag set.
+ */
+bool mapping_hwpoisoned_range(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	void **slot;
+	struct radix_tree_iter iter;
+	pgoff_t start_index;
+	pgoff_t end_index = 0;
+	bool hwpoisoned = false;
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	start_index = start_byte >> PAGE_CACHE_SHIFT;
+	if (end_byte > 0)
+		end_index = (end_byte - 1) >> PAGE_CACHE_SHIFT;
+	rcu_read_lock();
+	radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
+			start_index, end_index, PAGECACHE_TAG_HWPOISON) {
+		hwpoisoned = true;
+		break;
+	}
+	rcu_read_unlock();
+	return hwpoisoned;
+}
+
+static bool get_pagecache_tag_hwpoison(struct address_space *mapping,
+					pgoff_t index)
+{
+	bool tag;
+	rcu_read_lock();
+	tag = radix_tree_tag_get(&mapping->page_tree, index,
+				 PAGECACHE_TAG_HWPOISON);
+	rcu_read_unlock();
+	return tag;
+}
+
+static void set_pagecache_tag_hwpoison(struct address_space *mapping,
+					pgoff_t idx)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_set(&mapping->page_tree, idx, PAGECACHE_TAG_HWPOISON);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
+static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
+					pgoff_t idx)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(&mapping->page_tree, idx, PAGECACHE_TAG_HWPOISON);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
+/*
  * Dirty pagecache page
+ *
+ * Memory error reporting (important especially on dirty pagecache error
+ * because dirty data is lost) with AS_EIO flag has some problems:
+ *  1) AS_EIO is not sticky, so when a thread received an error report and
+ *     failed to take proper actions with it, the error flag will be lost
+ *     and other threads read/write with old data from storage and use it
+ *     as if no memory error happens.
+ *  2) mapping->flags is file-wide information, while the memory error is an
+ *     event on a single page. So we lose the info about where in the file
+ *     was corrupted.
+ *  3) Even dirty pagecache error can be recoverable if there is a copy data
+ *     of the newest version in user processes' buffers, but with AS_EIO
+ *     we can't handle that case.
+ *
+ * To solve these, we handle dirty pagecache errors by replacing the error
+ * page with alternative one which has PAGECACHE_TAG_HWPOISON at the page
+ * index on mapping->page_tree set. Although setting PAGECACHE_TAG_HWPOISON
+ * is enough for its purpose, we also set PG_HWPoison for users to find the
+ * page easily (for example with tools/vm/page-types.c.) The page looks
+ * similar to a normal hwpoisoned page, but it's not isolated (connected to
+ * pagecache), or the memory at the physical address is not really corrupted.
+ *
+ * This quasi-hwpoisoned page works to keep reporting the error for all
+ * processes which try to access to the error address until it is resolved
+ * or the system reboots.
+ *
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
  */
 static int me_pagecache_dirty(struct page *p, unsigned long pfn)
 {
+	int ret;
 	struct address_space *mapping = page_mapping(p);
+	pgoff_t index;
+	struct inode *inode = NULL;
+	struct page *new;
 
 	SetPageError(p);
-	/* TBD: print more information about the file. */
 	if (mapping) {
+		index = page_index(p);
+		/*
+		 * we take inode refcount to keep it's pagecache or mapping
+		 * on the memory until the error is resolved.
+		 */
+		inode = igrab(mapping->host);
+		pr_info("MCE %#lx: memory error on dirty pagecache (page offset:%lu, inode:%lu, dev:%s)\n",
+			page_to_pfn(p), index, inode->i_ino, inode->i_sb->s_id);
+	}
+
+	ret = me_pagecache_clean(p, pfn);
+
+	if (inode) {
+		/*
+		 * There's a potential race where some other thread can
+		 * allocate another page and add it at the error address of
+		 * the mapping, before the below code adds an alternative
+		 * (quasi-hwpoisoned) page. In such case, we detect it by
+		 * the failure of add_to_page_cache_lru(), and we give up
+		 * doing error containment (fallback to old the AS_EIO things).
+		 */
+		new = page_cache_alloc_cold(mapping);
+		if (!new)
+			goto out_iput;
+		ret = add_to_page_cache_lru(new, mapping, page_index(p),
+					    GFP_KERNEL);
+		if (ret)
+			goto out_put_page;
 		/*
-		 * IO error will be reported by write(), fsync(), etc.
-		 * who check the mapping.
-		 * This way the application knows that something went
-		 * wrong with its dirty file data.
-		 *
-		 * There's one open issue:
-		 *
-		 * The EIO will be only reported on the next IO
-		 * operation and then cleared through the IO map.
-		 * Normally Linux has two mechanisms to pass IO error
-		 * first through the AS_EIO flag in the address space
-		 * and then through the PageError flag in the page.
-		 * Since we drop pages on memory failure handling the
-		 * only mechanism open to use is through AS_AIO.
-		 *
-		 * This has the disadvantage that it gets cleared on
-		 * the first operation that returns an error, while
-		 * the PageError bit is more sticky and only cleared
-		 * when the page is reread or dropped.  If an
-		 * application assumes it will always get error on
-		 * fsync, but does other operations on the fd before
-		 * and the page is dropped between then the error
-		 * will not be properly reported.
-		 *
-		 * This can already happen even without hwpoisoned
-		 * pages: first on metadata IO errors (which only
-		 * report through AS_EIO) or when the page is dropped
-		 * at the wrong time.
-		 *
-		 * So right now we assume that the application DTRT on
-		 * the first EIO, but we're not worse than other parts
-		 * of the kernel.
+		 * Newly allocated page can remain on pagevec, so without
+		 * draining it subsequent isolation doesn't work.
 		 */
-		mapping_set_error(mapping, EIO);
+		lru_add_drain_all();
+		if (isolate_lru_page(new))
+			goto out;
+		inc_zone_page_state(new, NR_ISOLATED_ANON +
+				    page_is_file_cache(new));
+		SetPageHWPoison(new);
+		page_cache_release(new);
+		set_pagecache_tag_hwpoison(mapping, page_index(p));
+		unlock_page(new);
+		ret = RECOVERED;
 	}
+	return ret;
 
-	return me_pagecache_clean(p, pfn);
+out:
+	delete_from_page_cache(new);
+	unlock_page(new);
+out_put_page:
+	page_cache_release(new);
+out_iput:
+	iput(mapping->host);
+	mapping_set_error(mapping, EIO);
+	return FAILED;
 }
 
 /*
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] mm/memory-failure.c: add code to resolve quasi-hwpoisoned page
  2014-03-13 21:39 ` Naoya Horiguchi
@ 2014-03-13 21:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch introduces three ways to resolve quasi-hwpoisoned pages:
 1. unpoison: this is a test feature, but if users accept data lost (then
    continue with rereading old data from storage,) this could be tolerable.
 2. truncate: if discarding a part of a file which includes a memory error
    is OK for your applications, this could be reasonable too.
 3. full page overwrite: if your application is prepared to dirty pagecache
    error and it has a copy data (or it can recreate the proper data,)
    the application can overwrite the page-sized address range on the error
    and continue to run without caring about the error.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/pagemap.h | 16 +++++++++++++
 mm/filemap.c            | 14 ++++++++---
 mm/memory-failure.c     | 62 ++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/truncate.c           |  7 ++++++
 4 files changed, 95 insertions(+), 4 deletions(-)

diff --git v3.14-rc6.orig/include/linux/pagemap.h v3.14-rc6/include/linux/pagemap.h
index 5e234d0d0baf..715962f7ea7a 100644
--- v3.14-rc6.orig/include/linux/pagemap.h
+++ v3.14-rc6/include/linux/pagemap.h
@@ -589,12 +589,28 @@ static inline int add_to_page_cache(struct page *page,
 #ifdef CONFIG_MEMORY_FAILURE
 bool mapping_hwpoisoned_range(struct address_space *mapping,
 				loff_t start_byte, loff_t end_byte);
+bool page_quasi_hwpoisoned(struct address_space *mapping, struct page *page);
+void hwpoison_resolve_pagecache_error(struct address_space *mapping,
+				struct page *page, bool free);
+bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count);
 #else
 static inline bool mapping_hwpoisoned_range(struct address_space *mapping,
 				loff_t start_byte, loff_t end_byte)
 {
 	return false;
 }
+static inline bool page_quasi_hwpoisoned(struct address_space *mapping,
+					struct page *page)
+{
+	return false;
+}
+#define hwpoison_resolve_pagecache_error(mapping, page, free) do {} while (0)
+static inline bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count)
+{
+	return false;
+}
 #endif /* CONFIG_MEMORY_FAILURE */
 
 #endif /* _LINUX_PAGEMAP_H */
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 887f2dfaf185..f58b36e313ad 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -2110,8 +2110,7 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
         if (unlikely(*pos < 0))
                 return -EINVAL;
 
-	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
-					      *pos + *count)))
+	if (unlikely(hwpoison_partial_overwrite(file->f_mapping, *pos, *count)))
 		return -EHWPOISON;
 
 	if (!isblk) {
@@ -2222,7 +2221,13 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
 
 	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
+	/*
+	 * When the write range includes hwpoisoned region (then written is
+	 * -EHWPOISON,) we already confirmed in generic_write_checks() that
+	 * it's full page overwrite and we can safely invalidate the error,
+	 * so the write doesn't have to fail.
+	 */
+	if (written && written != -EHWPOISON)
 		goto out;
 
 	/*
@@ -2362,6 +2367,9 @@ static ssize_t generic_perform_write(struct file *file,
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (page_quasi_hwpoisoned(mapping, page))
+			hwpoison_resolve_pagecache_error(mapping, page, false);
+
 		pagefault_disable();
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		pagefault_enable();
diff --git v3.14-rc6.orig/mm/memory-failure.c v3.14-rc6/mm/memory-failure.c
index 34f2c046af22..0eca5449d251 100644
--- v3.14-rc6.orig/mm/memory-failure.c
+++ v3.14-rc6/mm/memory-failure.c
@@ -665,6 +665,57 @@ static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+inline bool page_quasi_hwpoisoned(struct address_space *mapping,
+					struct page *page)
+{
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	return unlikely(get_pagecache_tag_hwpoison(mapping, page_index(page)));
+}
+
+/*
+ * This function clears a quasi-hwpoisoned page and turns it into a normal
+ * LRU page. Callers should check that @page is really quasi-hwpoisoned,
+ * and must not call this for real error pages.
+ */
+void hwpoison_resolve_pagecache_error(struct address_space *mapping,
+				      struct page *page, bool free)
+{
+	VM_BUG_ON(PageLRU(page));
+	VM_BUG_ON(!PageLocked(page));
+
+	ClearPageHWPoison(page);
+	clear_pagecache_tag_hwpoison(mapping, page_index(page));
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
+	putback_lru_page(page);
+	if (free) {
+		lru_add_drain_all();
+		delete_from_page_cache(page);
+	}
+	iput(mapping->host);
+}
+
+/*
+ * Return true if a given range [pos, pos+count) *partially* overlaps with
+ * hwpoisoned page. Effectively it checks only boundary pages' overlapness.
+ */
+bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count)
+{
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	if (!mapping_hwpoisoned_range(mapping, pos, pos + count))
+		return false;
+
+	if (!PAGE_ALIGNED(pos) &&
+	    get_pagecache_tag_hwpoison(mapping, pos >> PAGE_SHIFT))
+		return true;
+	if (!PAGE_ALIGNED(pos + count) &&
+	    get_pagecache_tag_hwpoison(mapping, (pos + count) >> PAGE_SHIFT))
+		return true;
+	return false;
+}
+
 /*
  * Dirty pagecache page
  *
@@ -691,7 +742,10 @@ static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
  *
  * This quasi-hwpoisoned page works to keep reporting the error for all
  * processes which try to access to the error address until it is resolved
- * or the system reboots.
+ * or the system reboots. Quasi-hwpoisoned pages can be resolved by unpoison,
+ * truncate, and full page overwrite. In full page overwrite, the quasi-
+ * hwpoisoned pages safely turn into the normal LRU pages, so we expect
+ * userspace to do this when they received the error report if possible.
  *
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
@@ -1496,12 +1550,18 @@ int unpoison_memory(unsigned long pfn)
 	 * the free buddy page pool.
 	 */
 	if (TestClearPageHWPoison(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping && page_quasi_hwpoisoned(mapping, page)) {
+			hwpoison_resolve_pagecache_error(mapping, page, true);
+			goto unlock;
+		}
 		pr_info("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &num_poisoned_pages);
 		freeit = 1;
 		if (PageHuge(page))
 			clear_page_hwpoison_huge_page(page);
 	}
+unlock:
 	unlock_page(page);
 
 	put_page(page);
diff --git v3.14-rc6.orig/mm/truncate.c v3.14-rc6/mm/truncate.c
index 353b683afd6e..92d7097dfc6d 100644
--- v3.14-rc6.orig/mm/truncate.c
+++ v3.14-rc6/mm/truncate.c
@@ -103,6 +103,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
 	ClearPageMappedToDisk(page);
+
+	if (page_quasi_hwpoisoned(mapping, page))
+		hwpoison_resolve_pagecache_error(mapping, page, false);
+
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -439,6 +443,9 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	if (page_quasi_hwpoisoned(mapping, page))
+		hwpoison_resolve_pagecache_error(mapping, page, false);
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] mm/memory-failure.c: add code to resolve quasi-hwpoisoned page
@ 2014-03-13 21:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch introduces three ways to resolve quasi-hwpoisoned pages:
 1. unpoison: this is a test feature, but if users accept data lost (then
    continue with rereading old data from storage,) this could be tolerable.
 2. truncate: if discarding a part of a file which includes a memory error
    is OK for your applications, this could be reasonable too.
 3. full page overwrite: if your application is prepared to dirty pagecache
    error and it has a copy data (or it can recreate the proper data,)
    the application can overwrite the page-sized address range on the error
    and continue to run without caring about the error.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/pagemap.h | 16 +++++++++++++
 mm/filemap.c            | 14 ++++++++---
 mm/memory-failure.c     | 62 ++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/truncate.c           |  7 ++++++
 4 files changed, 95 insertions(+), 4 deletions(-)

diff --git v3.14-rc6.orig/include/linux/pagemap.h v3.14-rc6/include/linux/pagemap.h
index 5e234d0d0baf..715962f7ea7a 100644
--- v3.14-rc6.orig/include/linux/pagemap.h
+++ v3.14-rc6/include/linux/pagemap.h
@@ -589,12 +589,28 @@ static inline int add_to_page_cache(struct page *page,
 #ifdef CONFIG_MEMORY_FAILURE
 bool mapping_hwpoisoned_range(struct address_space *mapping,
 				loff_t start_byte, loff_t end_byte);
+bool page_quasi_hwpoisoned(struct address_space *mapping, struct page *page);
+void hwpoison_resolve_pagecache_error(struct address_space *mapping,
+				struct page *page, bool free);
+bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count);
 #else
 static inline bool mapping_hwpoisoned_range(struct address_space *mapping,
 				loff_t start_byte, loff_t end_byte)
 {
 	return false;
 }
+static inline bool page_quasi_hwpoisoned(struct address_space *mapping,
+					struct page *page)
+{
+	return false;
+}
+#define hwpoison_resolve_pagecache_error(mapping, page, free) do {} while (0)
+static inline bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count)
+{
+	return false;
+}
 #endif /* CONFIG_MEMORY_FAILURE */
 
 #endif /* _LINUX_PAGEMAP_H */
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 887f2dfaf185..f58b36e313ad 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -2110,8 +2110,7 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
         if (unlikely(*pos < 0))
                 return -EINVAL;
 
-	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
-					      *pos + *count)))
+	if (unlikely(hwpoison_partial_overwrite(file->f_mapping, *pos, *count)))
 		return -EHWPOISON;
 
 	if (!isblk) {
@@ -2222,7 +2221,13 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
 
 	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
+	/*
+	 * When the write range includes hwpoisoned region (then written is
+	 * -EHWPOISON,) we already confirmed in generic_write_checks() that
+	 * it's full page overwrite and we can safely invalidate the error,
+	 * so the write doesn't have to fail.
+	 */
+	if (written && written != -EHWPOISON)
 		goto out;
 
 	/*
@@ -2362,6 +2367,9 @@ static ssize_t generic_perform_write(struct file *file,
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (page_quasi_hwpoisoned(mapping, page))
+			hwpoison_resolve_pagecache_error(mapping, page, false);
+
 		pagefault_disable();
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		pagefault_enable();
diff --git v3.14-rc6.orig/mm/memory-failure.c v3.14-rc6/mm/memory-failure.c
index 34f2c046af22..0eca5449d251 100644
--- v3.14-rc6.orig/mm/memory-failure.c
+++ v3.14-rc6/mm/memory-failure.c
@@ -665,6 +665,57 @@ static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+inline bool page_quasi_hwpoisoned(struct address_space *mapping,
+					struct page *page)
+{
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	return unlikely(get_pagecache_tag_hwpoison(mapping, page_index(page)));
+}
+
+/*
+ * This function clears a quasi-hwpoisoned page and turns it into a normal
+ * LRU page. Callers should check that @page is really quasi-hwpoisoned,
+ * and must not call this for real error pages.
+ */
+void hwpoison_resolve_pagecache_error(struct address_space *mapping,
+				      struct page *page, bool free)
+{
+	VM_BUG_ON(PageLRU(page));
+	VM_BUG_ON(!PageLocked(page));
+
+	ClearPageHWPoison(page);
+	clear_pagecache_tag_hwpoison(mapping, page_index(page));
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
+	putback_lru_page(page);
+	if (free) {
+		lru_add_drain_all();
+		delete_from_page_cache(page);
+	}
+	iput(mapping->host);
+}
+
+/*
+ * Return true if a given range [pos, pos+count) *partially* overlaps with
+ * hwpoisoned page. Effectively it checks only boundary pages' overlapness.
+ */
+bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count)
+{
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	if (!mapping_hwpoisoned_range(mapping, pos, pos + count))
+		return false;
+
+	if (!PAGE_ALIGNED(pos) &&
+	    get_pagecache_tag_hwpoison(mapping, pos >> PAGE_SHIFT))
+		return true;
+	if (!PAGE_ALIGNED(pos + count) &&
+	    get_pagecache_tag_hwpoison(mapping, (pos + count) >> PAGE_SHIFT))
+		return true;
+	return false;
+}
+
 /*
  * Dirty pagecache page
  *
@@ -691,7 +742,10 @@ static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
  *
  * This quasi-hwpoisoned page works to keep reporting the error for all
  * processes which try to access to the error address until it is resolved
- * or the system reboots.
+ * or the system reboots. Quasi-hwpoisoned pages can be resolved by unpoison,
+ * truncate, and full page overwrite. In full page overwrite, the quasi-
+ * hwpoisoned pages safely turn into the normal LRU pages, so we expect
+ * userspace to do this when they received the error report if possible.
  *
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
@@ -1496,12 +1550,18 @@ int unpoison_memory(unsigned long pfn)
 	 * the free buddy page pool.
 	 */
 	if (TestClearPageHWPoison(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping && page_quasi_hwpoisoned(mapping, page)) {
+			hwpoison_resolve_pagecache_error(mapping, page, true);
+			goto unlock;
+		}
 		pr_info("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &num_poisoned_pages);
 		freeit = 1;
 		if (PageHuge(page))
 			clear_page_hwpoison_huge_page(page);
 	}
+unlock:
 	unlock_page(page);
 
 	put_page(page);
diff --git v3.14-rc6.orig/mm/truncate.c v3.14-rc6/mm/truncate.c
index 353b683afd6e..92d7097dfc6d 100644
--- v3.14-rc6.orig/mm/truncate.c
+++ v3.14-rc6/mm/truncate.c
@@ -103,6 +103,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
 	ClearPageMappedToDisk(page);
+
+	if (page_quasi_hwpoisoned(mapping, page))
+		hwpoison_resolve_pagecache_error(mapping, page, false);
+
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -439,6 +443,9 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	if (page_quasi_hwpoisoned(mapping, page))
+		hwpoison_resolve_pagecache_error(mapping, page, false);
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] fs/proc/page.c: introduce /proc/kpagecache interface
  2014-03-13 21:39 ` Naoya Horiguchi
@ 2014-03-13 21:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

/proc/pid/pagemap is one of powerful analyzing and testing features about
page mapping. This is also useful to know about page status combined with
/proc/kpageflag or /proc/kpagecount. One missing is the similar interface to
scan over pagecache of a given file without opening it or mapping it to
virtual address, which could impact other workloads. So this patch provides it.

Usage is simple: 1) write a file path to be scanned into the interface,
and 2) read 64-bit entries, each of which is associated with the page on
each page index.

Good in-kernel tree example is tools/vm/page-types.c (some code added on
it in the later patch.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/page.c     | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  11 ++++--
 2 files changed, 113 insertions(+), 4 deletions(-)

diff --git v3.14-rc6.orig/fs/proc/page.c v3.14-rc6/fs/proc/page.c
index e647c55275d9..4be6f72783d3 100644
--- v3.14-rc6.orig/fs/proc/page.c
+++ v3.14-rc6/fs/proc/page.c
@@ -9,6 +9,8 @@
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
 #include <linux/kernel-page-flags.h>
+#include <linux/path.h>
+#include <linux/namei.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -212,10 +214,114 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+static struct path kpagecache_path;
+
+#define KPC_TAGS_BITS	__NR_PAGECACHE_TAGS
+#define KPC_TAGS_OFFSET	(64 - KPC_TAGS_BITS)
+#define KPC_TAGS_MASK	(((1LL << KPC_TAGS_BITS) - 1) << KPC_TAGS_OFFSET)
+#define KPC_TAGS(bits)	(((bits) << KPC_TAGS_OFFSET) & KPC_TAGS_MASK)
+/* a few bits remaining between two fields. */
+#define KPC_PFN_BITS	(64 - PAGE_CACHE_SHIFT)
+#define KPC_PFN_MASK	((1LL << KPC_PFN_BITS) - 1)
+#define KPC_PFN(pfn)	((pfn) & KPC_PFN_MASK)
+
+static u64 get_pagecache_tags(struct radix_tree_root *root, unsigned long index)
+{
+	int i;
+	unsigned long tags = 0;
+	for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
+		if (radix_tree_tag_get(root, index, i))
+			tags |=  1 << i;
+	return KPC_TAGS(tags);
+}
+
+static ssize_t kpagecache_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	unsigned long src = *ppos;
+	struct address_space *mapping;
+	loff_t size;
+	pgoff_t index;
+	struct radix_tree_iter iter;
+	void **slot;
+	ssize_t ret = 0;
+
+	if (!kpagecache_path.dentry)
+		return 0;
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+	mapping = kpagecache_path.dentry->d_inode->i_mapping;
+	size = i_size_read(mapping->host);
+	if (!size)
+		return 0;
+	size = (size - 1) >> PAGE_CACHE_SHIFT;
+	index = src / KPMSIZE;
+	count = min_t(unsigned long, count, ((size + 1) * KPMSIZE) - src);
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree,
+				 &iter, index, index + count / KPMSIZE - 1) {
+		struct page *page = radix_tree_deref_slot(slot);
+		u64 entry;
+		if (unlikely(!page))
+			continue;
+		entry = get_pagecache_tags(&mapping->page_tree, iter.index);
+		entry |= KPC_PFN(page_to_pfn(page));
+		count = (iter.index - index + 1) * KPMSIZE;
+		if (put_user(entry, out + iter.index - index))
+			break;
+	}
+	rcu_read_unlock();
+	*ppos += count;
+	if (!ret)
+		ret = count;
+	return ret;
+}
+
+static ssize_t kpagecache_write(struct file *file, const char __user *pathname,
+			       size_t count, loff_t *ppos)
+{
+	struct path path;
+	int err;
+	struct address_space *mapping;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!pathname) {
+		if (kpagecache_path.dentry) {
+			path_put(&kpagecache_path);
+			kpagecache_path.mnt = NULL;
+			kpagecache_path.dentry = NULL;
+		}
+		return count;
+	}
+
+	err = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
+	if (err)
+		return -EINVAL;
+	if (kpagecache_path.dentry != path.dentry) {
+		path_put(&kpagecache_path);
+		kpagecache_path.mnt = path.mnt;
+		kpagecache_path.dentry = path.dentry;
+	} else
+		path_put(&path);
+	return count;
+}
+
+static const struct file_operations proc_kpagecache_operations = {
+	.llseek		= mem_lseek,
+	.read		= kpagecache_read,
+	.write		= kpagecache_write,
+};
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+	proc_create("kpagecache", S_IRUSR|S_IWUSR, NULL,
+			&proc_kpagecache_operations);
 	return 0;
 }
 fs_initcall(proc_page_init);
diff --git v3.14-rc6.orig/include/linux/fs.h v3.14-rc6/include/linux/fs.h
index 1e8966919044..6bf7ddcfc138 100644
--- v3.14-rc6.orig/include/linux/fs.h
+++ v3.14-rc6/include/linux/fs.h
@@ -472,12 +472,15 @@ struct block_device {
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
  * radix trees
  */
-#define PAGECACHE_TAG_DIRTY	0
-#define PAGECACHE_TAG_WRITEBACK	1
-#define PAGECACHE_TAG_TOWRITE	2
+enum {
+	PAGECACHE_TAG_DIRTY,
+	PAGECACHE_TAG_WRITEBACK,
+	PAGECACHE_TAG_TOWRITE,
 #ifdef CONFIG_MEMORY_FAILURE
-#define PAGECACHE_TAG_HWPOISON	3
+	PAGECACHE_TAG_HWPOISON,
 #endif
+	__NR_PAGECACHE_TAGS,
+};
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] fs/proc/page.c: introduce /proc/kpagecache interface
@ 2014-03-13 21:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

/proc/pid/pagemap is one of powerful analyzing and testing features about
page mapping. This is also useful to know about page status combined with
/proc/kpageflag or /proc/kpagecount. One missing is the similar interface to
scan over pagecache of a given file without opening it or mapping it to
virtual address, which could impact other workloads. So this patch provides it.

Usage is simple: 1) write a file path to be scanned into the interface,
and 2) read 64-bit entries, each of which is associated with the page on
each page index.

Good in-kernel tree example is tools/vm/page-types.c (some code added on
it in the later patch.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/page.c     | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  11 ++++--
 2 files changed, 113 insertions(+), 4 deletions(-)

diff --git v3.14-rc6.orig/fs/proc/page.c v3.14-rc6/fs/proc/page.c
index e647c55275d9..4be6f72783d3 100644
--- v3.14-rc6.orig/fs/proc/page.c
+++ v3.14-rc6/fs/proc/page.c
@@ -9,6 +9,8 @@
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
 #include <linux/kernel-page-flags.h>
+#include <linux/path.h>
+#include <linux/namei.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -212,10 +214,114 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+static struct path kpagecache_path;
+
+#define KPC_TAGS_BITS	__NR_PAGECACHE_TAGS
+#define KPC_TAGS_OFFSET	(64 - KPC_TAGS_BITS)
+#define KPC_TAGS_MASK	(((1LL << KPC_TAGS_BITS) - 1) << KPC_TAGS_OFFSET)
+#define KPC_TAGS(bits)	(((bits) << KPC_TAGS_OFFSET) & KPC_TAGS_MASK)
+/* a few bits remaining between two fields. */
+#define KPC_PFN_BITS	(64 - PAGE_CACHE_SHIFT)
+#define KPC_PFN_MASK	((1LL << KPC_PFN_BITS) - 1)
+#define KPC_PFN(pfn)	((pfn) & KPC_PFN_MASK)
+
+static u64 get_pagecache_tags(struct radix_tree_root *root, unsigned long index)
+{
+	int i;
+	unsigned long tags = 0;
+	for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
+		if (radix_tree_tag_get(root, index, i))
+			tags |=  1 << i;
+	return KPC_TAGS(tags);
+}
+
+static ssize_t kpagecache_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	unsigned long src = *ppos;
+	struct address_space *mapping;
+	loff_t size;
+	pgoff_t index;
+	struct radix_tree_iter iter;
+	void **slot;
+	ssize_t ret = 0;
+
+	if (!kpagecache_path.dentry)
+		return 0;
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+	mapping = kpagecache_path.dentry->d_inode->i_mapping;
+	size = i_size_read(mapping->host);
+	if (!size)
+		return 0;
+	size = (size - 1) >> PAGE_CACHE_SHIFT;
+	index = src / KPMSIZE;
+	count = min_t(unsigned long, count, ((size + 1) * KPMSIZE) - src);
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree,
+				 &iter, index, index + count / KPMSIZE - 1) {
+		struct page *page = radix_tree_deref_slot(slot);
+		u64 entry;
+		if (unlikely(!page))
+			continue;
+		entry = get_pagecache_tags(&mapping->page_tree, iter.index);
+		entry |= KPC_PFN(page_to_pfn(page));
+		count = (iter.index - index + 1) * KPMSIZE;
+		if (put_user(entry, out + iter.index - index))
+			break;
+	}
+	rcu_read_unlock();
+	*ppos += count;
+	if (!ret)
+		ret = count;
+	return ret;
+}
+
+static ssize_t kpagecache_write(struct file *file, const char __user *pathname,
+			       size_t count, loff_t *ppos)
+{
+	struct path path;
+	int err;
+	struct address_space *mapping;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!pathname) {
+		if (kpagecache_path.dentry) {
+			path_put(&kpagecache_path);
+			kpagecache_path.mnt = NULL;
+			kpagecache_path.dentry = NULL;
+		}
+		return count;
+	}
+
+	err = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
+	if (err)
+		return -EINVAL;
+	if (kpagecache_path.dentry != path.dentry) {
+		path_put(&kpagecache_path);
+		kpagecache_path.mnt = path.mnt;
+		kpagecache_path.dentry = path.dentry;
+	} else
+		path_put(&path);
+	return count;
+}
+
+static const struct file_operations proc_kpagecache_operations = {
+	.llseek		= mem_lseek,
+	.read		= kpagecache_read,
+	.write		= kpagecache_write,
+};
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+	proc_create("kpagecache", S_IRUSR|S_IWUSR, NULL,
+			&proc_kpagecache_operations);
 	return 0;
 }
 fs_initcall(proc_page_init);
diff --git v3.14-rc6.orig/include/linux/fs.h v3.14-rc6/include/linux/fs.h
index 1e8966919044..6bf7ddcfc138 100644
--- v3.14-rc6.orig/include/linux/fs.h
+++ v3.14-rc6/include/linux/fs.h
@@ -472,12 +472,15 @@ struct block_device {
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
  * radix trees
  */
-#define PAGECACHE_TAG_DIRTY	0
-#define PAGECACHE_TAG_WRITEBACK	1
-#define PAGECACHE_TAG_TOWRITE	2
+enum {
+	PAGECACHE_TAG_DIRTY,
+	PAGECACHE_TAG_WRITEBACK,
+	PAGECACHE_TAG_TOWRITE,
 #ifdef CONFIG_MEMORY_FAILURE
-#define PAGECACHE_TAG_HWPOISON	3
+	PAGECACHE_TAG_HWPOISON,
 #endif
+	__NR_PAGECACHE_TAGS,
+};
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] tools/vm/page-types.c: add file scanning mode
  2014-03-13 21:39 ` Naoya Horiguchi
@ 2014-03-13 21:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch introduces a new mode for file scanning, where when page-types
is called with -f <filepath>, it registers a given file to /proc/kpagecache,
and scans pages in the pagecache of the file.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/vm/page-types.c | 117 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 101 insertions(+), 16 deletions(-)

diff --git v3.14-rc6.orig/tools/vm/page-types.c v3.14-rc6/tools/vm/page-types.c
index f9be24d9efac..e9f1882378c7 100644
--- v3.14-rc6.orig/tools/vm/page-types.c
+++ v3.14-rc6/tools/vm/page-types.c
@@ -33,6 +33,7 @@
 #include <sys/errno.h>
 #include <sys/fcntl.h>
 #include <sys/mount.h>
+#include <sys/stat.h>
 #include <sys/statfs.h>
 #include "../../include/uapi/linux/magic.h"
 #include "../../include/uapi/linux/kernel-page-flags.h"
@@ -75,6 +76,7 @@
 
 #define KPF_BYTES		8
 #define PROC_KPAGEFLAGS		"/proc/kpageflags"
+#define PROC_KPAGECACHE		"/proc/kpagecache"
 
 /* [32-] kernel hacking assistances */
 #define KPF_RESERVED		32
@@ -158,6 +160,7 @@ static int		opt_raw;	/* for kernel developers */
 static int		opt_list;	/* list pages (in ranges) */
 static int		opt_no_summary;	/* don't show summary */
 static pid_t		opt_pid;	/* process to walk */
+static int		opt_file;	/* walk over pagecache of file */
 
 #define MAX_ADDR_RANGES	1024
 static int		nr_addr_ranges;
@@ -178,6 +181,7 @@ static int		page_size;
 
 static int		pagemap_fd;
 static int		kpageflags_fd;
+static int		kpagecache_fd;
 
 static int		opt_hwpoison;
 static int		opt_unpoison;
@@ -276,6 +280,13 @@ static unsigned long kpageflags_read(uint64_t *buf,
 	return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages);
 }
 
+static unsigned long kpagecache_read(uint64_t *buf,
+				     unsigned long index,
+				     unsigned long pages)
+{
+	return do_u64_read(kpagecache_fd, PROC_KPAGECACHE, buf, index, pages);
+}
+
 static unsigned long pagemap_read(uint64_t *buf,
 				  unsigned long index,
 				  unsigned long pages)
@@ -358,7 +369,7 @@ static void show_page_range(unsigned long voffset,
 	}
 
 	if (count) {
-		if (opt_pid)
+		if (opt_pid || opt_file)
 			printf("%lx\t", voff);
 		printf("%lx\t%lx\t%s\n",
 				index, count, page_flag_name(flags0));
@@ -378,6 +389,19 @@ static void show_page(unsigned long voffset,
 	printf("%lx\t%s\n", offset, page_flag_name(flags));
 }
 
+#define __NR_PAGECACHE_TAGS	4
+#define KPC_TAGS_BITS	__NR_PAGECACHE_TAGS
+#define KPC_TAGS_OFFSET	(64 - KPC_TAGS_BITS)
+#define KPC_TAGS_MASK	(((1ULL << KPC_TAGS_BITS) - 1) << KPC_TAGS_OFFSET)
+#define KPC_TAGS(entry)	((entry & KPC_TAGS_MASK) >> KPC_TAGS_OFFSET)
+
+static void show_file_page(unsigned long voffset,
+			   unsigned long offset, uint64_t flags, uint64_t entry)
+{
+	printf("%lx\t%lx\t%llx\t%s\n",
+	       voffset, offset, KPC_TAGS(entry), page_flag_name(flags));
+}
+
 static void show_summary(void)
 {
 	size_t i;
@@ -564,10 +588,15 @@ static void add_page(unsigned long voffset,
 	if (opt_unpoison)
 		unpoison_page(offset);
 
-	if (opt_list == 1)
-		show_page_range(voffset, offset, flags);
-	else if (opt_list == 2)
-		show_page(voffset, offset, flags);
+	if (opt_pid || !opt_file) {
+		if (opt_list == 1)
+			show_page_range(voffset, offset, flags);
+		else if (opt_list == 2)
+			show_page(voffset, offset, flags);
+	} else {
+		if (opt_list)
+			show_file_page(voffset, offset, flags, pme);
+	}
 
 	nr_pages[hash_slot(flags)]++;
 	total_pages++;
@@ -646,6 +675,41 @@ static void walk_task(unsigned long index, unsigned long count)
 	}
 }
 
+char *kpagecache_path;
+struct stat kpagecache_stat;
+
+#define KPAGECACHE_BATCH	(64 << 10)	/* 64k pages */
+static void walk_file(unsigned long index, unsigned long count)
+{
+	uint64_t buf[KPAGECACHE_BATCH];
+	unsigned long batch;
+	unsigned long pages;
+	unsigned long pfn;
+	unsigned long i;
+	unsigned long end_index = count;
+	unsigned long size;
+
+	stat(kpagecache_path, &kpagecache_stat);
+	size = kpagecache_stat.st_size;
+	if (size > 0)
+		size = (size - 1) / 4096;
+	end_index = min_t(unsigned long, index + count - 1, size);
+	while (index <= end_index) {
+		batch = min_t(unsigned long, count, PAGEMAP_BATCH);
+		pages = kpagecache_read(buf, index, batch);
+		if (pages == 0)
+			break;
+		for (i = 0; i < pages; i++) {
+			pfn = buf[i] & ((1UL << 52) - 1UL);
+			if (pfn)
+				walk_pfn(index + i, pfn, 1, buf[i]);
+		}
+
+		index += pages;
+		count -= pages;
+	}
+}
+
 static void add_addr_range(unsigned long offset, unsigned long size)
 {
 	if (nr_addr_ranges >= MAX_ADDR_RANGES)
@@ -666,10 +730,12 @@ static void walk_addr_ranges(void)
 		add_addr_range(0, ULONG_MAX);
 
 	for (i = 0; i < nr_addr_ranges; i++)
-		if (!opt_pid)
-			walk_pfn(0, opt_offset[i], opt_size[i], 0);
-		else
+		if (opt_pid)
 			walk_task(opt_offset[i], opt_size[i]);
+		else if (opt_file)
+			walk_file(opt_offset[i], opt_size[i]);
+		else
+			walk_pfn(0, opt_offset[i], opt_size[i], 0);
 
 	close(kpageflags_fd);
 }
@@ -699,9 +765,7 @@ static void usage(void)
 "            -a|--addr    addr-spec     Walk a range of pages\n"
 "            -b|--bits    bits-spec     Walk pages with specified bits\n"
 "            -p|--pid     pid           Walk process address space\n"
-#if 0 /* planned features */
 "            -f|--file    filename      Walk file address space\n"
-#endif
 "            -l|--list                  Show page details in ranges\n"
 "            -L|--list-each             Show page details one by one\n"
 "            -N|--no-summary            Don't show summary info\n"
@@ -801,6 +865,18 @@ static void parse_pid(const char *str)
 
 static void parse_file(const char *name)
 {
+	int ret;
+	kpagecache_path = (char *)name;
+	kpagecache_fd = checked_open(PROC_KPAGECACHE, O_RDWR);
+	ret = write(kpagecache_fd, name, strlen(name));
+	if (ret != (int)strlen(name))
+		fatal("Failed to set file on %s\n", PROC_KPAGECACHE);
+}
+
+static void close_kpagecache(void)
+{
+	write(kpagecache_fd, NULL, 1);
+	close(kpagecache_fd);
 }
 
 static void parse_addr_range(const char *optarg)
@@ -953,6 +1029,7 @@ int main(int argc, char *argv[])
 			break;
 		case 'f':
 			parse_file(optarg);
+			opt_file = 1;
 			break;
 		case 'a':
 			parse_addr_range(optarg);
@@ -989,18 +1066,26 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (opt_list && opt_pid)
-		printf("voffset\t");
-	if (opt_list == 1)
-		printf("offset\tlen\tflags\n");
-	if (opt_list == 2)
-		printf("offset\tflags\n");
+	if (opt_pid || !opt_file) {
+		if (opt_pid)
+			printf("voffset\t");
+		if (opt_list == 1)
+			printf("offset\tlen\tflags\n");
+		if (opt_list == 2)
+			printf("offset\tflags\n");
+	} else {
+		if (opt_list)
+			printf("pgoff\tpfn\ttags\tflags\n");
+	}
 
 	walk_addr_ranges();
 
 	if (opt_list == 1)
 		show_page_range(0, 0, 0);  /* drain the buffer */
 
+	if (opt_file == 1)
+		close_kpagecache();
+
 	if (opt_no_summary)
 		return 0;
 
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] tools/vm/page-types.c: add file scanning mode
@ 2014-03-13 21:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch introduces a new mode for file scanning, where when page-types
is called with -f <filepath>, it registers a given file to /proc/kpagecache,
and scans pages in the pagecache of the file.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 tools/vm/page-types.c | 117 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 101 insertions(+), 16 deletions(-)

diff --git v3.14-rc6.orig/tools/vm/page-types.c v3.14-rc6/tools/vm/page-types.c
index f9be24d9efac..e9f1882378c7 100644
--- v3.14-rc6.orig/tools/vm/page-types.c
+++ v3.14-rc6/tools/vm/page-types.c
@@ -33,6 +33,7 @@
 #include <sys/errno.h>
 #include <sys/fcntl.h>
 #include <sys/mount.h>
+#include <sys/stat.h>
 #include <sys/statfs.h>
 #include "../../include/uapi/linux/magic.h"
 #include "../../include/uapi/linux/kernel-page-flags.h"
@@ -75,6 +76,7 @@
 
 #define KPF_BYTES		8
 #define PROC_KPAGEFLAGS		"/proc/kpageflags"
+#define PROC_KPAGECACHE		"/proc/kpagecache"
 
 /* [32-] kernel hacking assistances */
 #define KPF_RESERVED		32
@@ -158,6 +160,7 @@ static int		opt_raw;	/* for kernel developers */
 static int		opt_list;	/* list pages (in ranges) */
 static int		opt_no_summary;	/* don't show summary */
 static pid_t		opt_pid;	/* process to walk */
+static int		opt_file;	/* walk over pagecache of file */
 
 #define MAX_ADDR_RANGES	1024
 static int		nr_addr_ranges;
@@ -178,6 +181,7 @@ static int		page_size;
 
 static int		pagemap_fd;
 static int		kpageflags_fd;
+static int		kpagecache_fd;
 
 static int		opt_hwpoison;
 static int		opt_unpoison;
@@ -276,6 +280,13 @@ static unsigned long kpageflags_read(uint64_t *buf,
 	return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages);
 }
 
+static unsigned long kpagecache_read(uint64_t *buf,
+				     unsigned long index,
+				     unsigned long pages)
+{
+	return do_u64_read(kpagecache_fd, PROC_KPAGECACHE, buf, index, pages);
+}
+
 static unsigned long pagemap_read(uint64_t *buf,
 				  unsigned long index,
 				  unsigned long pages)
@@ -358,7 +369,7 @@ static void show_page_range(unsigned long voffset,
 	}
 
 	if (count) {
-		if (opt_pid)
+		if (opt_pid || opt_file)
 			printf("%lx\t", voff);
 		printf("%lx\t%lx\t%s\n",
 				index, count, page_flag_name(flags0));
@@ -378,6 +389,19 @@ static void show_page(unsigned long voffset,
 	printf("%lx\t%s\n", offset, page_flag_name(flags));
 }
 
+#define __NR_PAGECACHE_TAGS	4
+#define KPC_TAGS_BITS	__NR_PAGECACHE_TAGS
+#define KPC_TAGS_OFFSET	(64 - KPC_TAGS_BITS)
+#define KPC_TAGS_MASK	(((1ULL << KPC_TAGS_BITS) - 1) << KPC_TAGS_OFFSET)
+#define KPC_TAGS(entry)	((entry & KPC_TAGS_MASK) >> KPC_TAGS_OFFSET)
+
+static void show_file_page(unsigned long voffset,
+			   unsigned long offset, uint64_t flags, uint64_t entry)
+{
+	printf("%lx\t%lx\t%llx\t%s\n",
+	       voffset, offset, KPC_TAGS(entry), page_flag_name(flags));
+}
+
 static void show_summary(void)
 {
 	size_t i;
@@ -564,10 +588,15 @@ static void add_page(unsigned long voffset,
 	if (opt_unpoison)
 		unpoison_page(offset);
 
-	if (opt_list == 1)
-		show_page_range(voffset, offset, flags);
-	else if (opt_list == 2)
-		show_page(voffset, offset, flags);
+	if (opt_pid || !opt_file) {
+		if (opt_list == 1)
+			show_page_range(voffset, offset, flags);
+		else if (opt_list == 2)
+			show_page(voffset, offset, flags);
+	} else {
+		if (opt_list)
+			show_file_page(voffset, offset, flags, pme);
+	}
 
 	nr_pages[hash_slot(flags)]++;
 	total_pages++;
@@ -646,6 +675,41 @@ static void walk_task(unsigned long index, unsigned long count)
 	}
 }
 
+char *kpagecache_path;
+struct stat kpagecache_stat;
+
+#define KPAGECACHE_BATCH	(64 << 10)	/* 64k pages */
+static void walk_file(unsigned long index, unsigned long count)
+{
+	uint64_t buf[KPAGECACHE_BATCH];
+	unsigned long batch;
+	unsigned long pages;
+	unsigned long pfn;
+	unsigned long i;
+	unsigned long end_index = count;
+	unsigned long size;
+
+	stat(kpagecache_path, &kpagecache_stat);
+	size = kpagecache_stat.st_size;
+	if (size > 0)
+		size = (size - 1) / 4096;
+	end_index = min_t(unsigned long, index + count - 1, size);
+	while (index <= end_index) {
+		batch = min_t(unsigned long, count, PAGEMAP_BATCH);
+		pages = kpagecache_read(buf, index, batch);
+		if (pages == 0)
+			break;
+		for (i = 0; i < pages; i++) {
+			pfn = buf[i] & ((1UL << 52) - 1UL);
+			if (pfn)
+				walk_pfn(index + i, pfn, 1, buf[i]);
+		}
+
+		index += pages;
+		count -= pages;
+	}
+}
+
 static void add_addr_range(unsigned long offset, unsigned long size)
 {
 	if (nr_addr_ranges >= MAX_ADDR_RANGES)
@@ -666,10 +730,12 @@ static void walk_addr_ranges(void)
 		add_addr_range(0, ULONG_MAX);
 
 	for (i = 0; i < nr_addr_ranges; i++)
-		if (!opt_pid)
-			walk_pfn(0, opt_offset[i], opt_size[i], 0);
-		else
+		if (opt_pid)
 			walk_task(opt_offset[i], opt_size[i]);
+		else if (opt_file)
+			walk_file(opt_offset[i], opt_size[i]);
+		else
+			walk_pfn(0, opt_offset[i], opt_size[i], 0);
 
 	close(kpageflags_fd);
 }
@@ -699,9 +765,7 @@ static void usage(void)
 "            -a|--addr    addr-spec     Walk a range of pages\n"
 "            -b|--bits    bits-spec     Walk pages with specified bits\n"
 "            -p|--pid     pid           Walk process address space\n"
-#if 0 /* planned features */
 "            -f|--file    filename      Walk file address space\n"
-#endif
 "            -l|--list                  Show page details in ranges\n"
 "            -L|--list-each             Show page details one by one\n"
 "            -N|--no-summary            Don't show summary info\n"
@@ -801,6 +865,18 @@ static void parse_pid(const char *str)
 
 static void parse_file(const char *name)
 {
+	int ret;
+	kpagecache_path = (char *)name;
+	kpagecache_fd = checked_open(PROC_KPAGECACHE, O_RDWR);
+	ret = write(kpagecache_fd, name, strlen(name));
+	if (ret != (int)strlen(name))
+		fatal("Failed to set file on %s\n", PROC_KPAGECACHE);
+}
+
+static void close_kpagecache(void)
+{
+	write(kpagecache_fd, NULL, 1);
+	close(kpagecache_fd);
 }
 
 static void parse_addr_range(const char *optarg)
@@ -953,6 +1029,7 @@ int main(int argc, char *argv[])
 			break;
 		case 'f':
 			parse_file(optarg);
+			opt_file = 1;
 			break;
 		case 'a':
 			parse_addr_range(optarg);
@@ -989,18 +1066,26 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (opt_list && opt_pid)
-		printf("voffset\t");
-	if (opt_list == 1)
-		printf("offset\tlen\tflags\n");
-	if (opt_list == 2)
-		printf("offset\tflags\n");
+	if (opt_pid || !opt_file) {
+		if (opt_pid)
+			printf("voffset\t");
+		if (opt_list == 1)
+			printf("offset\tlen\tflags\n");
+		if (opt_list == 2)
+			printf("offset\tflags\n");
+	} else {
+		if (opt_list)
+			printf("pgoff\tpfn\ttags\tflags\n");
+	}
 
 	walk_addr_ranges();
 
 	if (opt_list == 1)
 		show_page_range(0, 0, 0);  /* drain the buffer */
 
+	if (opt_file == 1)
+		close_kpagecache();
+
 	if (opt_no_summary)
 		return 0;
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] Documentation: update Documentation/vm/pagemap.txt
  2014-03-13 21:39 ` Naoya Horiguchi
@ 2014-03-13 21:39   ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch adds a chapter about kpagecache interface.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 Documentation/vm/pagemap.txt | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git v3.14-rc6.orig/Documentation/vm/pagemap.txt v3.14-rc6/Documentation/vm/pagemap.txt
index 5948e455c4d2..c8039263fc45 100644
--- v3.14-rc6.orig/Documentation/vm/pagemap.txt
+++ v3.14-rc6/Documentation/vm/pagemap.txt
@@ -150,3 +150,32 @@ once.
 Reading from any of the files will return -EINVAL if you are not starting
 the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
 into the file), or if the size of the read is not a multiple of 8 bytes.
+
+
+kpagecache, from file perspective
+---------------------------------
+
+Similarly to pagemap, we have a interface /proc/kpagecache to let userspace
+know about pagecache profile for a given file. Unlike pagemap interface,
+we don't have to mmap() and fault in the target file, so the impact on other
+workloads (maybe profile targets) is minimum.
+
+To use this interface, firstly we open it and write the name of the target
+file to it for setup. And then we can read the pagecache info of the file.
+The file contains the array of 64-bit entries for each page offset. Data
+format is like below:
+
+    * Bits  0-49  page frame number (PFN) if present
+    * Bits 50-59  zero (reserved)
+    * Bits 60-63  pagecache tags
+
+Good example is tools/vm/page-types.c, where we can get the list of pages
+belonging to the file like below:
+
+  $ dd if=/dev/urandom of=file bs=4096 count=2
+  $ date >> file
+  $ tools/vm/page-types -f file -Nl
+  pgoff	pfn	tags	flags
+  0	3305f	0	__RU_l______________________________
+  1	374bb	0	__RU_l______________________________
+  2	6c5ac	1	___UDlA_____________________________
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] Documentation: update Documentation/vm/pagemap.txt
@ 2014-03-13 21:39   ` Naoya Horiguchi
  0 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 21:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

This patch adds a chapter about kpagecache interface.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 Documentation/vm/pagemap.txt | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git v3.14-rc6.orig/Documentation/vm/pagemap.txt v3.14-rc6/Documentation/vm/pagemap.txt
index 5948e455c4d2..c8039263fc45 100644
--- v3.14-rc6.orig/Documentation/vm/pagemap.txt
+++ v3.14-rc6/Documentation/vm/pagemap.txt
@@ -150,3 +150,32 @@ once.
 Reading from any of the files will return -EINVAL if you are not starting
 the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
 into the file), or if the size of the read is not a multiple of 8 bytes.
+
+
+kpagecache, from file perspective
+---------------------------------
+
+Similarly to pagemap, we have a interface /proc/kpagecache to let userspace
+know about pagecache profile for a given file. Unlike pagemap interface,
+we don't have to mmap() and fault in the target file, so the impact on other
+workloads (maybe profile targets) is minimum.
+
+To use this interface, firstly we open it and write the name of the target
+file to it for setup. And then we can read the pagecache info of the file.
+The file contains the array of 64-bit entries for each page offset. Data
+format is like below:
+
+    * Bits  0-49  page frame number (PFN) if present
+    * Bits 50-59  zero (reserved)
+    * Bits 60-63  pagecache tags
+
+Good example is tools/vm/page-types.c, where we can get the list of pages
+belonging to the file like below:
+
+  $ dd if=/dev/urandom of=file bs=4096 count=2
+  $ date >> file
+  $ tools/vm/page-types -f file -Nl
+  pgoff	pfn	tags	flags
+  0	3305f	0	__RU_l______________________________
+  1	374bb	0	__RU_l______________________________
+  2	6c5ac	1	___UDlA_____________________________
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] mm/memory-failure.c: add code to resolve quasi-hwpoisoned page
  2014-03-13 21:39 ` Naoya Horiguchi
                   ` (6 preceding siblings ...)
  (?)
@ 2014-03-13 22:28 ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 22:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, andi, fengguang.wu, tony.luck, liwanp, david, j-nomura, linux-mm

# patch 3 might be lost, so I resend it. Thanks Tony.

This patch introduces three ways to resolve quasi-hwpoisoned pages:
 1. unpoison: this is a test feature, but if users accept data lost (then
    continue with rereading old data from storage,) this could be tolerable.
 2. truncate: if discarding a part of a file which includes a memory error
    is OK for your applications, this could be reasonable too.
 3. full page overwrite: if your application is prepared to dirty pagecache
    error and it has a copy data (or it can recreate the proper data,)
    the application can overwrite the page-sized address range on the error
    and continue to run without caring about the error.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/pagemap.h | 16 +++++++++++++
 mm/filemap.c            | 14 ++++++++---
 mm/memory-failure.c     | 62 ++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/truncate.c           |  7 ++++++
 4 files changed, 95 insertions(+), 4 deletions(-)

diff --git v3.14-rc6.orig/include/linux/pagemap.h v3.14-rc6/include/linux/pagemap.h
index 5e234d0d0baf..715962f7ea7a 100644
--- v3.14-rc6.orig/include/linux/pagemap.h
+++ v3.14-rc6/include/linux/pagemap.h
@@ -589,12 +589,28 @@ static inline int add_to_page_cache(struct page *page,
 #ifdef CONFIG_MEMORY_FAILURE
 bool mapping_hwpoisoned_range(struct address_space *mapping,
 				loff_t start_byte, loff_t end_byte);
+bool page_quasi_hwpoisoned(struct address_space *mapping, struct page *page);
+void hwpoison_resolve_pagecache_error(struct address_space *mapping,
+				struct page *page, bool free);
+bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count);
 #else
 static inline bool mapping_hwpoisoned_range(struct address_space *mapping,
 				loff_t start_byte, loff_t end_byte)
 {
 	return false;
 }
+static inline bool page_quasi_hwpoisoned(struct address_space *mapping,
+					struct page *page)
+{
+	return false;
+}
+#define hwpoison_resolve_pagecache_error(mapping, page, free) do {} while (0)
+static inline bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count)
+{
+	return false;
+}
 #endif /* CONFIG_MEMORY_FAILURE */
 
 #endif /* _LINUX_PAGEMAP_H */
diff --git v3.14-rc6.orig/mm/filemap.c v3.14-rc6/mm/filemap.c
index 887f2dfaf185..f58b36e313ad 100644
--- v3.14-rc6.orig/mm/filemap.c
+++ v3.14-rc6/mm/filemap.c
@@ -2110,8 +2110,7 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
         if (unlikely(*pos < 0))
                 return -EINVAL;
 
-	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
-					      *pos + *count)))
+	if (unlikely(hwpoison_partial_overwrite(file->f_mapping, *pos, *count)))
 		return -EHWPOISON;
 
 	if (!isblk) {
@@ -2222,7 +2221,13 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
 
 	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
+	/*
+	 * When the write range includes hwpoisoned region (then written is
+	 * -EHWPOISON,) we already confirmed in generic_write_checks() that
+	 * it's full page overwrite and we can safely invalidate the error,
+	 * so the write doesn't have to fail.
+	 */
+	if (written && written != -EHWPOISON)
 		goto out;
 
 	/*
@@ -2362,6 +2367,9 @@ static ssize_t generic_perform_write(struct file *file,
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (page_quasi_hwpoisoned(mapping, page))
+			hwpoison_resolve_pagecache_error(mapping, page, false);
+
 		pagefault_disable();
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		pagefault_enable();
diff --git v3.14-rc6.orig/mm/memory-failure.c v3.14-rc6/mm/memory-failure.c
index 34f2c046af22..0eca5449d251 100644
--- v3.14-rc6.orig/mm/memory-failure.c
+++ v3.14-rc6/mm/memory-failure.c
@@ -665,6 +665,57 @@ static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+inline bool page_quasi_hwpoisoned(struct address_space *mapping,
+					struct page *page)
+{
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	return unlikely(get_pagecache_tag_hwpoison(mapping, page_index(page)));
+}
+
+/*
+ * This function clears a quasi-hwpoisoned page and turns it into a normal
+ * LRU page. Callers should check that @page is really quasi-hwpoisoned,
+ * and must not call this for real error pages.
+ */
+void hwpoison_resolve_pagecache_error(struct address_space *mapping,
+				      struct page *page, bool free)
+{
+	VM_BUG_ON(PageLRU(page));
+	VM_BUG_ON(!PageLocked(page));
+
+	ClearPageHWPoison(page);
+	clear_pagecache_tag_hwpoison(mapping, page_index(page));
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
+	putback_lru_page(page);
+	if (free) {
+		lru_add_drain_all();
+		delete_from_page_cache(page);
+	}
+	iput(mapping->host);
+}
+
+/*
+ * Return true if a given range [pos, pos+count) *partially* overlaps with
+ * hwpoisoned page. Effectively it checks only boundary pages' overlapness.
+ */
+bool hwpoison_partial_overwrite(struct address_space *mapping,
+				loff_t pos, size_t count)
+{
+	if (!sysctl_memory_failure_recovery)
+		return false;
+	if (!mapping_hwpoisoned_range(mapping, pos, pos + count))
+		return false;
+
+	if (!PAGE_ALIGNED(pos) &&
+	    get_pagecache_tag_hwpoison(mapping, pos >> PAGE_SHIFT))
+		return true;
+	if (!PAGE_ALIGNED(pos + count) &&
+	    get_pagecache_tag_hwpoison(mapping, (pos + count) >> PAGE_SHIFT))
+		return true;
+	return false;
+}
+
 /*
  * Dirty pagecache page
  *
@@ -691,7 +742,10 @@ static void clear_pagecache_tag_hwpoison(struct address_space *mapping,
  *
  * This quasi-hwpoisoned page works to keep reporting the error for all
  * processes which try to access to the error address until it is resolved
- * or the system reboots.
+ * or the system reboots. Quasi-hwpoisoned pages can be resolved by unpoison,
+ * truncate, and full page overwrite. In full page overwrite, the quasi-
+ * hwpoisoned pages safely turn into the normal LRU pages, so we expect
+ * userspace to do this when they received the error report if possible.
  *
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
@@ -1496,12 +1550,18 @@ int unpoison_memory(unsigned long pfn)
 	 * the free buddy page pool.
 	 */
 	if (TestClearPageHWPoison(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping && page_quasi_hwpoisoned(mapping, page)) {
+			hwpoison_resolve_pagecache_error(mapping, page, true);
+			goto unlock;
+		}
 		pr_info("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &num_poisoned_pages);
 		freeit = 1;
 		if (PageHuge(page))
 			clear_page_hwpoison_huge_page(page);
 	}
+unlock:
 	unlock_page(page);
 
 	put_page(page);
diff --git v3.14-rc6.orig/mm/truncate.c v3.14-rc6/mm/truncate.c
index 353b683afd6e..92d7097dfc6d 100644
--- v3.14-rc6.orig/mm/truncate.c
+++ v3.14-rc6/mm/truncate.c
@@ -103,6 +103,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
 	ClearPageMappedToDisk(page);
+
+	if (page_quasi_hwpoisoned(mapping, page))
+		hwpoison_resolve_pagecache_error(mapping, page, false);
+
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -439,6 +443,9 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	if (page_quasi_hwpoisoned(mapping, page))
+		hwpoison_resolve_pagecache_error(mapping, page, false);
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: [PATCH 4/6] fs/proc/page.c: introduce /proc/kpagecache interface
  2014-03-13 21:39   ` Naoya Horiguchi
@ 2014-03-13 23:09     ` Luck, Tony
  -1 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2014-03-13 23:09 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu, Fengguang, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

> Usage is simple: 1) write a file path to be scanned into the interface,
> and 2) read 64-bit entries, each of which is associated with the page on
> each page index.

Do we have other interfaces that work like that?  I suppose this is file is only open
to "root", so it may be safe to assume that applications using this won't stomp on
each other.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 4/6] fs/proc/page.c: introduce /proc/kpagecache interface
@ 2014-03-13 23:09     ` Luck, Tony
  0 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2014-03-13 23:09 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-kernel
  Cc: Andrew Morton, Andi Kleen, Wu, Fengguang, Wanpeng Li,
	Dave Chinner, Jun'ichi Nomura, linux-mm

> Usage is simple: 1) write a file path to be scanned into the interface,
> and 2) read 64-bit entries, each of which is associated with the page on
> each page index.

Do we have other interfaces that work like that?  I suppose this is file is only open
to "root", so it may be safe to assume that applications using this won't stomp on
each other.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] fs/proc/page.c: introduce /proc/kpagecache interface
  2014-03-13 23:09     ` Luck, Tony
  (?)
@ 2014-03-13 23:44     ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-13 23:44 UTC (permalink / raw)
  To: tony.luck
  Cc: linux-kernel, akpm, andi, fengguang.wu, liwanp, david, j-nomura,
	linux-mm

On Thu, Mar 13, 2014 at 11:09:10PM +0000, Luck, Tony wrote:
> > Usage is simple: 1) write a file path to be scanned into the interface,
> > and 2) read 64-bit entries, each of which is associated with the page on
> > each page index.
> 
> Do we have other interfaces that work like that?

No, we don't. At first I thought of doing this under /proc/pid, but that did
not work because we want to scan the files which no process opens.

> I suppose this is file is only open
> to "root", so it may be safe to assume that applications using this won't stomp on
> each other.

Right, this is only for testing/debugging purpose (at least for now) so
limiting access is safe.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] mm/memory-failure.c: report and recovery for memory error on dirty pagecache
  2014-03-13 21:39   ` Naoya Horiguchi
@ 2014-03-15  3:17     ` Andi Kleen
  -1 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2014-03-15  3:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck,
	Wanpeng Li, Dave Chinner, Jun'ichi Nomura, linux-mm

On Thu, Mar 13, 2014 at 05:39:42PM -0400, Naoya Horiguchi wrote:
> Unifying error reporting between memory error and normal IO errors is ideal
> in a long run, but at first let's solve it separately. I hope that some code
> in this patch will be helpful when thinking of the unification.

The mechanisms should be very similar, right? 

It may be better to do both at the same time.

> index 60829565e552..1e8966919044 100644
> --- v3.14-rc6.orig/include/linux/fs.h
> +++ v3.14-rc6/include/linux/fs.h
> @@ -475,6 +475,9 @@ struct block_device {
>  #define PAGECACHE_TAG_DIRTY	0
>  #define PAGECACHE_TAG_WRITEBACK	1
>  #define PAGECACHE_TAG_TOWRITE	2
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define PAGECACHE_TAG_HWPOISON	3
> +#endif

No need to ifdef defines

> @@ -1133,6 +1139,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
>  			if (unlikely(page == NULL))
>  				goto no_cached_page;
>  		}
> +		if (unlikely(PageHWPoison(page))) {
> +			error = -EHWPOISON;
> +			goto readpage_error;
> +		}

Didn't we need this check before independent of the rest of the patch?

>  		if (PageReadahead(page)) {
>  			page_cache_async_readahead(mapping,
>  					ra, filp, page,
> @@ -2100,6 +2110,10 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
>          if (unlikely(*pos < 0))
>                  return -EINVAL;
>  
> +	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
> +					      *pos + *count)))
> +		return -EHWPOISON;

How expensive is that check? This will happen on every write.
Can it be somehow combined with the normal page cache lookup?

>   * Dirty pagecache page
> + *
> + * Memory error reporting (important especially on dirty pagecache error
> + * because dirty data is lost) with AS_EIO flag has some problems:

It doesn't make sense to have changelogs in comments. That is what
git is for.  At some point noone will care about the previous code.

> + * To solve these, we handle dirty pagecache errors by replacing the error

This part of the comment is good.

> +	pgoff_t index;
> +	struct inode *inode = NULL;
> +	struct page *new;
>  
>  	SetPageError(p);
> -	/* TBD: print more information about the file. */
>  	if (mapping) {
> +		index = page_index(p);
> +		/*
> +		 * we take inode refcount to keep it's pagecache or mapping
> +		 * on the memory until the error is resolved.

How does that work? Who "resolves" the error? 

> +		 */
> +		inode = igrab(mapping->host);
> +		pr_info("MCE %#lx: memory error on dirty pagecache (page offset:%lu, inode:%lu, dev:%s)\n",

Add the word file somewhere, you need to explain this in terms normal
sysadmins and not only kernel hackers can understand.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] mm/memory-failure.c: report and recovery for memory error on dirty pagecache
@ 2014-03-15  3:17     ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2014-03-15  3:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, Andrew Morton, Andi Kleen, Wu Fengguang, Tony Luck,
	Wanpeng Li, Dave Chinner, Jun'ichi Nomura, linux-mm

On Thu, Mar 13, 2014 at 05:39:42PM -0400, Naoya Horiguchi wrote:
> Unifying error reporting between memory error and normal IO errors is ideal
> in a long run, but at first let's solve it separately. I hope that some code
> in this patch will be helpful when thinking of the unification.

The mechanisms should be very similar, right? 

It may be better to do both at the same time.

> index 60829565e552..1e8966919044 100644
> --- v3.14-rc6.orig/include/linux/fs.h
> +++ v3.14-rc6/include/linux/fs.h
> @@ -475,6 +475,9 @@ struct block_device {
>  #define PAGECACHE_TAG_DIRTY	0
>  #define PAGECACHE_TAG_WRITEBACK	1
>  #define PAGECACHE_TAG_TOWRITE	2
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define PAGECACHE_TAG_HWPOISON	3
> +#endif

No need to ifdef defines

> @@ -1133,6 +1139,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
>  			if (unlikely(page == NULL))
>  				goto no_cached_page;
>  		}
> +		if (unlikely(PageHWPoison(page))) {
> +			error = -EHWPOISON;
> +			goto readpage_error;
> +		}

Didn't we need this check before independent of the rest of the patch?

>  		if (PageReadahead(page)) {
>  			page_cache_async_readahead(mapping,
>  					ra, filp, page,
> @@ -2100,6 +2110,10 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
>          if (unlikely(*pos < 0))
>                  return -EINVAL;
>  
> +	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
> +					      *pos + *count)))
> +		return -EHWPOISON;

How expensive is that check? This will happen on every write.
Can it be somehow combined with the normal page cache lookup?

>   * Dirty pagecache page
> + *
> + * Memory error reporting (important especially on dirty pagecache error
> + * because dirty data is lost) with AS_EIO flag has some problems:

It doesn't make sense to have changelogs in comments. That is what
git is for.  At some point noone will care about the previous code.

> + * To solve these, we handle dirty pagecache errors by replacing the error

This part of the comment is good.

> +	pgoff_t index;
> +	struct inode *inode = NULL;
> +	struct page *new;
>  
>  	SetPageError(p);
> -	/* TBD: print more information about the file. */
>  	if (mapping) {
> +		index = page_index(p);
> +		/*
> +		 * we take inode refcount to keep it's pagecache or mapping
> +		 * on the memory until the error is resolved.

How does that work? Who "resolves" the error? 

> +		 */
> +		inode = igrab(mapping->host);
> +		pr_info("MCE %#lx: memory error on dirty pagecache (page offset:%lu, inode:%lu, dev:%s)\n",

Add the word file somewhere, you need to explain this in terms normal
sysadmins and not only kernel hackers can understand.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] mm/memory-failure.c: report and recovery for memory error on dirty pagecache
  2014-03-15  3:17     ` Andi Kleen
  (?)
@ 2014-03-15  6:23     ` Naoya Horiguchi
  -1 siblings, 0 replies; 21+ messages in thread
From: Naoya Horiguchi @ 2014-03-15  6:23 UTC (permalink / raw)
  To: andi
  Cc: linux-kernel, akpm, fengguang.wu, tony.luck, liwanp, david,
	j-nomura, linux-mm

On Sat, Mar 15, 2014 at 04:17:59AM +0100, Andi Kleen wrote:
> On Thu, Mar 13, 2014 at 05:39:42PM -0400, Naoya Horiguchi wrote:
> > Unifying error reporting between memory error and normal IO errors is ideal
> > in a long run, but at first let's solve it separately. I hope that some code
> > in this patch will be helpful when thinking of the unification.
> 
> The mechanisms should be very similar, right? 

Yes.

> It may be better to do both at the same time.

Yes, it's better, but it's not trivial to test and confirm that patches
work fine (and I must learn more about IO error.)
But anyway, I'll try this maybe in next post.

> > index 60829565e552..1e8966919044 100644
> > --- v3.14-rc6.orig/include/linux/fs.h
> > +++ v3.14-rc6/include/linux/fs.h
> > @@ -475,6 +475,9 @@ struct block_device {
> >  #define PAGECACHE_TAG_DIRTY	0
> >  #define PAGECACHE_TAG_WRITEBACK	1
> >  #define PAGECACHE_TAG_TOWRITE	2
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +#define PAGECACHE_TAG_HWPOISON	3
> > +#endif
> 
> No need to ifdef defines

OK, I found that if CONFIG_MEMORY_FAILURE is n no one sets/checks this flag,
so it's not problematic that the number of PAGECACHE_TAG_* is more than
RADIX_TREE_MAX_TAGS (3 if !CONFIG_MEMORY_FAILURE). I'll remove this ifdef.

> > @@ -1133,6 +1139,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
> >  			if (unlikely(page == NULL))
> >  				goto no_cached_page;
> >  		}
> > +		if (unlikely(PageHWPoison(page))) {
> > +			error = -EHWPOISON;
> > +			goto readpage_error;
> > +		}
> 
> Didn't we need this check before independent of the rest of the patch?

I think this check should come with the rest of this patch, because before
this patchset, we have no page with PageHWPoison on pagecache (memory_failure()
removes it from pagecache via me_pagecache_clean(),) so the above check can't
detect error-affected address. Dummy hwpoison page introduced by this patch
makes it detectable.

> >  		if (PageReadahead(page)) {
> >  			page_cache_async_readahead(mapping,
> >  					ra, filp, page,
> > @@ -2100,6 +2110,10 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
> >          if (unlikely(*pos < 0))
> >                  return -EINVAL;
> >  
> > +	if (unlikely(mapping_hwpoisoned_range(file->f_mapping, *pos,
> > +					      *pos + *count)))
> > +		return -EHWPOISON;
> 
> How expensive is that check? This will happen on every write.
> Can it be somehow combined with the normal page cache lookup?

OK, so it's better to put this check just after a_ops->write_begin in
generic_perform_write(). If we find PageHWPoison, we break the do-while loop,
then we can do write correctly for healthy address before the error address.

> >   * Dirty pagecache page
> > + *
> > + * Memory error reporting (important especially on dirty pagecache error
> > + * because dirty data is lost) with AS_EIO flag has some problems:
> 
> It doesn't make sense to have changelogs in comments. That is what
> git is for.  At some point noone will care about the previous code.

Right, I'll remove this.

> > + * To solve these, we handle dirty pagecache errors by replacing the error
> 
> This part of the comment is good.
> 
> > +	pgoff_t index;
> > +	struct inode *inode = NULL;
> > +	struct page *new;
> >  
> >  	SetPageError(p);
> > -	/* TBD: print more information about the file. */
> >  	if (mapping) {
> > +		index = page_index(p);
> > +		/*
> > +		 * we take inode refcount to keep it's pagecache or mapping
> > +		 * on the memory until the error is resolved.
> 
> How does that work? Who "resolves" the error? 

This comment should have come with patch 3 which adds the resolver.
# I at first wrote a patch with patch 2 and 3 merged and after separated it,
# and my splitting was poor. I'll fix this.

> > +		 */
> > +		inode = igrab(mapping->host);
> > +		pr_info("MCE %#lx: memory error on dirty pagecache (page offset:%lu, inode:%lu, dev:%s)\n",
> 
> Add the word file somewhere, you need to explain this in terms normal
> sysadmins and not only kernel hackers can understand.

OK, so "MCE %#lx: memory error on dirty file cache (page offset:%lu, inode:%lu, dev:%s)\n"
looks better to me.

Thank you very much for close looking.
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-03-15  6:24 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-13 21:39 [PATCH 0/6] memory error report/recovery for dirty pagecache v3 Naoya Horiguchi
2014-03-13 21:39 ` Naoya Horiguchi
2014-03-13 21:39 ` [PATCH 1/6] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
2014-03-13 21:39   ` Naoya Horiguchi
2014-03-13 21:39 ` [PATCH 2/6] mm/memory-failure.c: report and recovery for memory error on dirty pagecache Naoya Horiguchi
2014-03-13 21:39   ` Naoya Horiguchi
2014-03-15  3:17   ` Andi Kleen
2014-03-15  3:17     ` Andi Kleen
2014-03-15  6:23     ` Naoya Horiguchi
2014-03-13 21:39 ` [PATCH 3/6] mm/memory-failure.c: add code to resolve quasi-hwpoisoned page Naoya Horiguchi
2014-03-13 21:39   ` Naoya Horiguchi
2014-03-13 21:39 ` [PATCH 4/6] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
2014-03-13 21:39   ` Naoya Horiguchi
2014-03-13 23:09   ` Luck, Tony
2014-03-13 23:09     ` Luck, Tony
2014-03-13 23:44     ` Naoya Horiguchi
2014-03-13 21:39 ` [PATCH 5/6] tools/vm/page-types.c: add file scanning mode Naoya Horiguchi
2014-03-13 21:39   ` Naoya Horiguchi
2014-03-13 21:39 ` [PATCH 6/6] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
2014-03-13 21:39   ` Naoya Horiguchi
2014-03-13 22:28 ` [PATCH 3/6] mm/memory-failure.c: add code to resolve quasi-hwpoisoned page Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.