* [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting @ 2012-08-22 15:17 Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean Naoya Horiguchi ` (3 more replies) 0 siblings, 4 replies; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-22 15:17 UTC (permalink / raw) To: Andi Kleen, Wu Fengguang, Andrew Morton Cc: Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel Hi, Based on the previous discussion, in this version I propose only error reporting fix ("overwrite recovery" is sparated out from this series.) I think Fengguang's patch (patch 2 in this series) has a corner case about inode cache drop, so I added patch 3 for it. Shortlog and diffstat Naoya Horiguchi (2): HWPOISON: fix action_result() to print out dirty/clean HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Wu Fengguang (1): HWPOISON: report sticky EIO for poisoned file fs/inode.c | 12 ++++++++++++ include/linux/pagemap.h | 24 ++++++++++++++++++++++++ mm/filemap.c | 11 +++++++++++ mm/memory-failure.c | 24 ++++++++++-------------- mm/truncate.c | 2 ++ 5 files changed, 59 insertions(+), 14 deletions(-) Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean 2012-08-22 15:17 [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Naoya Horiguchi @ 2012-08-22 15:17 ` Naoya Horiguchi 2012-08-23 9:33 ` Fengguang Wu 2012-08-22 15:17 ` [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file Naoya Horiguchi ` (2 subsequent siblings) 3 siblings, 1 reply; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-22 15:17 UTC (permalink / raw) To: Andi Kleen, Wu Fengguang, Andrew Morton Cc: Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel action_result() fails to print out "dirty" even if an error occurred on a dirty pagecache, because when we check PageDirty in action_result() it was cleared after page isolation even if it's dirty before error handling. This can break some applications that monitor this message, so should be fixed. There are several callers of action_result() except page_action(), but either of them are not for LRU pages but for free pages or kernel pages, so we don't have to consider dirty or not for them. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> --- mm/memory-failure.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index a6e2141..79dfb2f 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -779,16 +779,16 @@ static struct page_state { { compound, compound, "huge", me_huge_page }, #endif - { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty }, - { sc|dirty, sc, "swapcache", me_swapcache_clean }, + { sc|dirty, sc|dirty, "dirty swapcache", me_swapcache_dirty }, + { sc|dirty, sc, "clean swapcache", me_swapcache_clean }, - { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty}, - { unevict, unevict, "unevictable LRU", me_pagecache_clean}, + { unevict|dirty, unevict|dirty, "dirty unevictable LRU", me_pagecache_dirty }, + { unevict, unevict, "clean unevictable LRU", me_pagecache_clean }, - { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty }, - { mlock, mlock, "mlocked LRU", me_pagecache_clean }, + { mlock|dirty, mlock|dirty, "dirty mlocked LRU", me_pagecache_dirty }, + { mlock, mlock, "clean mlocked LRU", me_pagecache_clean }, - { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty }, + { lru|dirty, lru|dirty, "dirty LRU", me_pagecache_dirty }, { lru|dirty, lru, "clean LRU", me_pagecache_clean }, /* @@ -812,12 +812,8 @@ static struct page_state { static void action_result(unsigned long pfn, char *msg, int result) { - struct page *page = pfn_to_page(pfn); - - printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n", - pfn, - PageDirty(page) ? "dirty " : "", - msg, action_name[result]); + pr_err("MCE %#lx: %s page recovery: %s\n", + pfn, msg, action_name[result]); } static int page_action(struct page_state *ps, struct page *p, -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean 2012-08-22 15:17 ` [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean Naoya Horiguchi @ 2012-08-23 9:33 ` Fengguang Wu 2012-08-23 20:31 ` Naoya Horiguchi 0 siblings, 1 reply; 22+ messages in thread From: Fengguang Wu @ 2012-08-23 9:33 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Wed, Aug 22, 2012 at 11:17:33AM -0400, Naoya Horiguchi wrote: > action_result() fails to print out "dirty" even if an error occurred on a > dirty pagecache, because when we check PageDirty in action_result() it was > cleared after page isolation even if it's dirty before error handling. This > can break some applications that monitor this message, so should be fixed. > > There are several callers of action_result() except page_action(), but > either of them are not for LRU pages but for free pages or kernel pages, > so we don't have to consider dirty or not for them. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Reviewed-by: Andi Kleen <ak@linux.intel.com> > --- > mm/memory-failure.c | 22 +++++++++------------- > 1 file changed, 9 insertions(+), 13 deletions(-) > > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > index a6e2141..79dfb2f 100644 > --- v3.6-rc1.orig/mm/memory-failure.c > +++ v3.6-rc1/mm/memory-failure.c > @@ -779,16 +779,16 @@ static struct page_state { > { compound, compound, "huge", me_huge_page }, > #endif > > - { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty }, > - { sc|dirty, sc, "swapcache", me_swapcache_clean }, > + { sc|dirty, sc|dirty, "dirty swapcache", me_swapcache_dirty }, > + { sc|dirty, sc, "clean swapcache", me_swapcache_clean }, > > - { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty}, > - { unevict, unevict, "unevictable LRU", me_pagecache_clean}, > + { unevict|dirty, unevict|dirty, "dirty unevictable LRU", me_pagecache_dirty }, > + { unevict, unevict, "clean unevictable LRU", me_pagecache_clean }, > > - { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty }, > - { mlock, mlock, "mlocked LRU", me_pagecache_clean }, > + { mlock|dirty, mlock|dirty, "dirty mlocked LRU", me_pagecache_dirty }, > + { mlock, mlock, "clean mlocked LRU", me_pagecache_clean }, > > - { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty }, > + { lru|dirty, lru|dirty, "dirty LRU", me_pagecache_dirty }, > { lru|dirty, lru, "clean LRU", me_pagecache_clean }, According to the set_page_dirty() comment, the dirty bit might be set outside the page lock (however I don't know any concrete examples). That means the word "clean" is not 100% right. That's probably why we only report "dirty LRU" and didn't say "clean LRU". Thanks, Fengguang > /* > @@ -812,12 +812,8 @@ static struct page_state { > > static void action_result(unsigned long pfn, char *msg, int result) > { > - struct page *page = pfn_to_page(pfn); > - > - printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n", > - pfn, > - PageDirty(page) ? "dirty " : "", > - msg, action_name[result]); > + pr_err("MCE %#lx: %s page recovery: %s\n", > + pfn, msg, action_name[result]); > } > > static int page_action(struct page_state *ps, struct page *p, > -- > 1.7.11.4 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean 2012-08-23 9:33 ` Fengguang Wu @ 2012-08-23 20:31 ` Naoya Horiguchi 0 siblings, 0 replies; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-23 20:31 UTC (permalink / raw) To: Wu Fengguang Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel Hello, Thank you for your review. On Thu, Aug 23, 2012 at 05:33:30PM +0800, Fengguang Wu wrote: > On Wed, Aug 22, 2012 at 11:17:33AM -0400, Naoya Horiguchi wrote: > > action_result() fails to print out "dirty" even if an error occurred on a > > dirty pagecache, because when we check PageDirty in action_result() it was > > cleared after page isolation even if it's dirty before error handling. This > > can break some applications that monitor this message, so should be fixed. > > > > There are several callers of action_result() except page_action(), but > > either of them are not for LRU pages but for free pages or kernel pages, > > so we don't have to consider dirty or not for them. > > > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > Reviewed-by: Andi Kleen <ak@linux.intel.com> > > --- > > mm/memory-failure.c | 22 +++++++++------------- > > 1 file changed, 9 insertions(+), 13 deletions(-) > > > > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > > index a6e2141..79dfb2f 100644 > > --- v3.6-rc1.orig/mm/memory-failure.c > > +++ v3.6-rc1/mm/memory-failure.c > > @@ -779,16 +779,16 @@ static struct page_state { > > { compound, compound, "huge", me_huge_page }, > > #endif > > > > - { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty }, > > - { sc|dirty, sc, "swapcache", me_swapcache_clean }, > > + { sc|dirty, sc|dirty, "dirty swapcache", me_swapcache_dirty }, > > + { sc|dirty, sc, "clean swapcache", me_swapcache_clean }, > > > > - { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty}, > > - { unevict, unevict, "unevictable LRU", me_pagecache_clean}, > > + { unevict|dirty, unevict|dirty, "dirty unevictable LRU", me_pagecache_dirty }, > > + { unevict, unevict, "clean unevictable LRU", me_pagecache_clean }, > > > > - { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty }, > > - { mlock, mlock, "mlocked LRU", me_pagecache_clean }, > > + { mlock|dirty, mlock|dirty, "dirty mlocked LRU", me_pagecache_dirty }, > > + { mlock, mlock, "clean mlocked LRU", me_pagecache_clean }, > > > > - { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty }, > > + { lru|dirty, lru|dirty, "dirty LRU", me_pagecache_dirty }, > > { lru|dirty, lru, "clean LRU", me_pagecache_clean }, > > According to the set_page_dirty() comment, the dirty bit might be set > outside the page lock (however I don't know any concrete examples). > That means the word "clean" is not 100% right. That's probably why we > only report "dirty LRU" and didn't say "clean LRU". So this doesn't seem to be just a messaging problem. If PageDirty is set outside page lock, we can handle the dirty page only with me_pagecache_clean(), without me_pagecache_dirty(). It might be a good idea to add some check code to detect such kind of race and give up error isolation if it does. I'll dig into who sets dirty flags outside/inside page locks, and look for a workaround. (But it will be in another patch...) Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file 2012-08-22 15:17 [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean Naoya Horiguchi @ 2012-08-22 15:17 ` Naoya Horiguchi 2012-08-23 9:22 ` Fengguang Wu 2012-08-22 15:17 ` [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Naoya Horiguchi 2012-08-22 20:22 ` [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Andi Kleen 3 siblings, 1 reply; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-22 15:17 UTC (permalink / raw) To: Andi Kleen, Wu Fengguang, Andrew Morton Cc: Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel From: Wu Fengguang <fengguang.wu@intel.com> This makes the EIO reports on write(), fsync(), or the NFS close() sticky enough. The only way to get rid of it may be echo 3 > /proc/sys/vm/drop_caches Note that the impacted process will only be killed if it mapped the page. XXX via read()/write()/fsync() instead of memory mapped reads/writes, simply because it's very hard to find them. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/pagemap.h | 13 +++++++++++++ mm/filemap.c | 11 +++++++++++ mm/memory-failure.c | 2 +- 3 files changed, 25 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h index e42c762..4d8d821 100644 --- v3.6-rc1.orig/include/linux/pagemap.h +++ v3.6-rc1/include/linux/pagemap.h @@ -24,6 +24,7 @@ enum mapping_flags { AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ + AS_HWPOISON = __GFP_BITS_SHIFT + 4, /* hardware memory corruption */ }; static inline void mapping_set_error(struct address_space *mapping, int error) @@ -53,6 +54,18 @@ static inline int mapping_unevictable(struct address_space *mapping) return !!mapping; } +#ifdef CONFIG_MEMORY_FAILURE +static inline int mapping_hwpoison(struct address_space *mapping) +{ + return test_bit(AS_HWPOISON, &mapping->flags); +} +#else +static inline int mapping_hwpoison(struct address_space *mapping) +{ + return 0; +} +#endif + static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; diff --git v3.6-rc1.orig/mm/filemap.c v3.6-rc1/mm/filemap.c index fa5ca30..8bdaf57 100644 --- v3.6-rc1.orig/mm/filemap.c +++ v3.6-rc1/mm/filemap.c @@ -297,6 +297,8 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, ret = -ENOSPC; if (test_and_clear_bit(AS_EIO, &mapping->flags)) ret = -EIO; + if (mapping_hwpoison(mapping)) + ret = -EIO; return ret; } @@ -447,6 +449,15 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); + /* + * Hardware corrupted page will be removed from mapping, + * so we want to deny (possibly) reloading the old data. + */ + if (unlikely(mapping_hwpoison(mapping))) { + error = -EIO; + goto out; + } + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index 79dfb2f..a1e7e00 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - mapping_set_error(mapping, EIO); + set_bit(AS_HWPOISON, &mapping->flags); } return me_pagecache_clean(p, pfn); -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file 2012-08-22 15:17 ` [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file Naoya Horiguchi @ 2012-08-23 9:22 ` Fengguang Wu 2012-08-23 20:31 ` Naoya Horiguchi 0 siblings, 1 reply; 22+ messages in thread From: Fengguang Wu @ 2012-08-23 9:22 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Wed, Aug 22, 2012 at 11:17:34AM -0400, Naoya Horiguchi wrote: > From: Wu Fengguang <fengguang.wu@intel.com> > > This makes the EIO reports on write(), fsync(), or the NFS close() > sticky enough. The only way to get rid of it may be > > echo 3 > /proc/sys/vm/drop_caches That's no longer valid with your next patch. If I understand it right, the EIO will only go away after truncate. So it may also need to update comments in memory-failure.c and Documentation/vm/hwpoison.txt > Note that the impacted process will only be killed if it mapped the page. > XXX > via read()/write()/fsync() instead of memory mapped reads/writes, simply > because it's very hard to find them. Please remove the above scratched texts. > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/pagemap.h | 13 +++++++++++++ > mm/filemap.c | 11 +++++++++++ > mm/memory-failure.c | 2 +- > 3 files changed, 25 insertions(+), 1 deletion(-) > > diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h > index e42c762..4d8d821 100644 > --- v3.6-rc1.orig/include/linux/pagemap.h > +++ v3.6-rc1/include/linux/pagemap.h > @@ -24,6 +24,7 @@ enum mapping_flags { > AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ > AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ > AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ > + AS_HWPOISON = __GFP_BITS_SHIFT + 4, /* hardware memory corruption */ > }; > > static inline void mapping_set_error(struct address_space *mapping, int error) > @@ -53,6 +54,18 @@ static inline int mapping_unevictable(struct address_space *mapping) > return !!mapping; > } > > +#ifdef CONFIG_MEMORY_FAILURE > +static inline int mapping_hwpoison(struct address_space *mapping) > +{ > + return test_bit(AS_HWPOISON, &mapping->flags); > +} > +#else > +static inline int mapping_hwpoison(struct address_space *mapping) > +{ > + return 0; > +} > +#endif > + > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > { > return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; > diff --git v3.6-rc1.orig/mm/filemap.c v3.6-rc1/mm/filemap.c > index fa5ca30..8bdaf57 100644 > --- v3.6-rc1.orig/mm/filemap.c > +++ v3.6-rc1/mm/filemap.c > @@ -297,6 +297,8 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, > ret = -ENOSPC; > if (test_and_clear_bit(AS_EIO, &mapping->flags)) > ret = -EIO; > + if (mapping_hwpoison(mapping)) > + ret = -EIO; > > return ret; > } > @@ -447,6 +449,15 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > + /* > + * Hardware corrupted page will be removed from mapping, > + * so we want to deny (possibly) reloading the old data. > + */ > + if (unlikely(mapping_hwpoison(mapping))) { > + error = -EIO; > + goto out; > + } > + > error = mem_cgroup_cache_charge(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > index 79dfb2f..a1e7e00 100644 > --- v3.6-rc1.orig/mm/memory-failure.c > +++ v3.6-rc1/mm/memory-failure.c > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) > * the first EIO, but we're not worse than other parts > * of the kernel. > */ > - mapping_set_error(mapping, EIO); > + set_bit(AS_HWPOISON, &mapping->flags); > } > > return me_pagecache_clean(p, pfn); > -- > 1.7.11.4 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file 2012-08-23 9:22 ` Fengguang Wu @ 2012-08-23 20:31 ` Naoya Horiguchi 0 siblings, 0 replies; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-23 20:31 UTC (permalink / raw) To: Wu Fengguang Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Thu, Aug 23, 2012 at 05:22:11PM +0800, Fengguang Wu wrote: > On Wed, Aug 22, 2012 at 11:17:34AM -0400, Naoya Horiguchi wrote: > > From: Wu Fengguang <fengguang.wu@intel.com> > > > > This makes the EIO reports on write(), fsync(), or the NFS close() > > sticky enough. The only way to get rid of it may be > > > > echo 3 > /proc/sys/vm/drop_caches > > That's no longer valid with your next patch. If I understand it right, > the EIO will only go away after truncate. yes, that's right. > So it may also need to > update comments in memory-failure.c and Documentation/vm/hwpoison.txt OK. I'll add it in the next version. > > Note that the impacted process will only be killed if it mapped the page. > > XXX > > via read()/write()/fsync() instead of memory mapped reads/writes, simply > > because it's very hard to find them. > > Please remove the above scratched texts. Yes. Thanks, Naoya > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/linux/pagemap.h | 13 +++++++++++++ > > mm/filemap.c | 11 +++++++++++ > > mm/memory-failure.c | 2 +- > > 3 files changed, 25 insertions(+), 1 deletion(-) > > > > diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h > > index e42c762..4d8d821 100644 > > --- v3.6-rc1.orig/include/linux/pagemap.h > > +++ v3.6-rc1/include/linux/pagemap.h > > @@ -24,6 +24,7 @@ enum mapping_flags { > > AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ > > AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ > > AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ > > + AS_HWPOISON = __GFP_BITS_SHIFT + 4, /* hardware memory corruption */ > > }; > > > > static inline void mapping_set_error(struct address_space *mapping, int error) > > @@ -53,6 +54,18 @@ static inline int mapping_unevictable(struct address_space *mapping) > > return !!mapping; > > } > > > > +#ifdef CONFIG_MEMORY_FAILURE > > +static inline int mapping_hwpoison(struct address_space *mapping) > > +{ > > + return test_bit(AS_HWPOISON, &mapping->flags); > > +} > > +#else > > +static inline int mapping_hwpoison(struct address_space *mapping) > > +{ > > + return 0; > > +} > > +#endif > > + > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > > { > > return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; > > diff --git v3.6-rc1.orig/mm/filemap.c v3.6-rc1/mm/filemap.c > > index fa5ca30..8bdaf57 100644 > > --- v3.6-rc1.orig/mm/filemap.c > > +++ v3.6-rc1/mm/filemap.c > > @@ -297,6 +297,8 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, > > ret = -ENOSPC; > > if (test_and_clear_bit(AS_EIO, &mapping->flags)) > > ret = -EIO; > > + if (mapping_hwpoison(mapping)) > > + ret = -EIO; > > > > return ret; > > } > > @@ -447,6 +449,15 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > > + /* > > + * Hardware corrupted page will be removed from mapping, > > + * so we want to deny (possibly) reloading the old data. > > + */ > > + if (unlikely(mapping_hwpoison(mapping))) { > > + error = -EIO; > > + goto out; > > + } > > + > > error = mem_cgroup_cache_charge(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > > index 79dfb2f..a1e7e00 100644 > > --- v3.6-rc1.orig/mm/memory-failure.c > > +++ v3.6-rc1/mm/memory-failure.c > > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) > > * the first EIO, but we're not worse than other parts > > * of the kernel. > > */ > > - mapping_set_error(mapping, EIO); > > + set_bit(AS_HWPOISON, &mapping->flags); > > } > > > > return me_pagecache_clean(p, pfn); > > -- > > 1.7.11.4 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-22 15:17 [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file Naoya Horiguchi @ 2012-08-22 15:17 ` Naoya Horiguchi 2012-08-23 9:11 ` Fengguang Wu 2012-08-24 1:31 ` Dave Chinner 2012-08-22 20:22 ` [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Andi Kleen 3 siblings, 2 replies; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-22 15:17 UTC (permalink / raw) To: Andi Kleen, Wu Fengguang, Andrew Morton Cc: Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel "HWPOISON: report sticky EIO for poisoned file" still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- fs/inode.c | 12 ++++++++++++ include/linux/pagemap.h | 11 +++++++++++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* + * Keep inode caches on memory for user processes to certainly + * be aware of memory errors. + */ + if (unlikely(mapping_hwpoison(inode->i_mapping))) { + spin_unlock(&inode->i_lock); + continue; + } + + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) inode->i_state &= ~I_WILL_FREE; } + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) + mapping_clear_hwpoison(inode->i_mapping); + inode->i_state |= I_FREEING; if (!list_empty(&inode->i_lru)) inode_lru_list_del(inode); diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h index 4d8d821..9fce4e4 100644 --- v3.6-rc1.orig/include/linux/pagemap.h +++ v3.6-rc1/include/linux/pagemap.h @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) { return test_bit(AS_HWPOISON, &mapping->flags); } +static inline void mapping_set_hwpoison(struct address_space *mapping) +{ + set_bit(AS_HWPOISON, &mapping->flags); +} +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ + clear_bit(AS_HWPOISON, &mapping->flags); +} #else static inline int mapping_hwpoison(struct address_space *mapping) { return 0; } +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ +} #endif static inline gfp_t mapping_gfp_mask(struct address_space * mapping) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index a1e7e00..ca064c6 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - set_bit(AS_HWPOISON, &mapping->flags); + mapping_set_hwpoison(mapping); } return me_pagecache_clean(p, pfn); diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode->i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) + mapping_clear_hwpoison(inode->i_mapping); truncate_pagecache(inode, oldsize, newsize); } -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-22 15:17 ` [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Naoya Horiguchi @ 2012-08-23 9:11 ` Fengguang Wu 2012-08-23 20:31 ` Naoya Horiguchi 2012-08-24 1:31 ` Dave Chinner 1 sibling, 1 reply; 22+ messages in thread From: Fengguang Wu @ 2012-08-23 9:11 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > where we have possibilities of data lost. This is because in this fix > AS_HWPOISON is cleared when the inode cache is dropped. > > For example, consider an application in which a process periodically > (every 10 minutes) writes some logs on a file (and closes it after > each writes,) and at the end of each day some batch programs run using > the log file. If a memory error hits on dirty pagecache of this log file > just after periodic write/close and the inode cache is cleared before the > next write, then this application is not aware of the error and the batch > programs will work wrongly. > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > until we remove or completely truncate the hwpoisoned file. Good point! > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > --- > fs/inode.c | 12 ++++++++++++ > include/linux/pagemap.h | 11 +++++++++++ > mm/memory-failure.c | 2 +- > mm/truncate.c | 2 ++ > 4 files changed, 26 insertions(+), 1 deletion(-) > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > index ac8d904..8742397 100644 > --- v3.6-rc1.orig/fs/inode.c > +++ v3.6-rc1/fs/inode.c > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) > } > > /* > + * Keep inode caches on memory for user processes to certainly > + * be aware of memory errors. > + */ > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > + spin_unlock(&inode->i_lock); > + continue; > + } That chunk prevents reclaiming all the cached pages. However the intention is only to keep the struct inode together with the hwpoison bit? > + /* > * Referenced or dirty inodes are still in use. Give them > * another pass through the LRU as we canot reclaim them now. > */ > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) > inode->i_state &= ~I_WILL_FREE; > } > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) > + mapping_clear_hwpoison(inode->i_mapping); Is that clear necessary? Because the bit will be gone with the inode struct: it's going to be de-allocated anyway. > inode->i_state |= I_FREEING; > if (!list_empty(&inode->i_lru)) > inode_lru_list_del(inode); > diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h > index 4d8d821..9fce4e4 100644 > --- v3.6-rc1.orig/include/linux/pagemap.h > +++ v3.6-rc1/include/linux/pagemap.h > @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) > { > return test_bit(AS_HWPOISON, &mapping->flags); > } > +static inline void mapping_set_hwpoison(struct address_space *mapping) > +{ > + set_bit(AS_HWPOISON, &mapping->flags); > +} > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > +{ > + clear_bit(AS_HWPOISON, &mapping->flags); > +} > #else > static inline int mapping_hwpoison(struct address_space *mapping) > { > return 0; > } > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > +{ > +} > #endif > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > index a1e7e00..ca064c6 100644 > --- v3.6-rc1.orig/mm/memory-failure.c > +++ v3.6-rc1/mm/memory-failure.c > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) > * the first EIO, but we're not worse than other parts > * of the kernel. > */ > - set_bit(AS_HWPOISON, &mapping->flags); > + mapping_set_hwpoison(mapping); > } > > return me_pagecache_clean(p, pfn); > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > index 75801ac..82a994f 100644 > --- v3.6-rc1.orig/mm/truncate.c > +++ v3.6-rc1/mm/truncate.c > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > oldsize = inode->i_size; > i_size_write(inode, newsize); > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) It might be a bit better to test !newsize first. > + mapping_clear_hwpoison(inode->i_mapping); > > truncate_pagecache(inode, oldsize, newsize); > } > -- > 1.7.11.4 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-23 9:11 ` Fengguang Wu @ 2012-08-23 20:31 ` Naoya Horiguchi 2012-08-24 21:52 ` Naoya Horiguchi 0 siblings, 1 reply; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-23 20:31 UTC (permalink / raw) To: Wu Fengguang Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote: > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > where we have possibilities of data lost. This is because in this fix > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > For example, consider an application in which a process periodically > > (every 10 minutes) writes some logs on a file (and closes it after > > each writes,) and at the end of each day some batch programs run using > > the log file. If a memory error hits on dirty pagecache of this log file > > just after periodic write/close and the inode cache is cleared before the > > next write, then this application is not aware of the error and the batch > > programs will work wrongly. > > > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > > until we remove or completely truncate the hwpoisoned file. > > Good point! > > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > --- > > fs/inode.c | 12 ++++++++++++ > > include/linux/pagemap.h | 11 +++++++++++ > > mm/memory-failure.c | 2 +- > > mm/truncate.c | 2 ++ > > 4 files changed, 26 insertions(+), 1 deletion(-) > > > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > > index ac8d904..8742397 100644 > > --- v3.6-rc1.orig/fs/inode.c > > +++ v3.6-rc1/fs/inode.c > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) > > } > > > > /* > > + * Keep inode caches on memory for user processes to certainly > > + * be aware of memory errors. > > + */ > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > > + spin_unlock(&inode->i_lock); > > + continue; > > + } > > That chunk prevents reclaiming all the cached pages. However the intention > is only to keep the struct inode together with the hwpoison bit? Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file remain for long under high memory pressure. > > + /* > > * Referenced or dirty inodes are still in use. Give them > > * another pass through the LRU as we canot reclaim them now. > > */ > > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) > > inode->i_state &= ~I_WILL_FREE; > > } > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) > > + mapping_clear_hwpoison(inode->i_mapping); > > Is that clear necessary? Because the bit will be gone with the inode > struct: it's going to be de-allocated anyway. With the chunk in prune_icache_sb() we keep the inode struct with AS_HWPOISON set on memory, so in order to remove it, we need explicitly clear the bit. Without this clear, the inode remains until system reboot. > > inode->i_state |= I_FREEING; > > if (!list_empty(&inode->i_lru)) > > inode_lru_list_del(inode); > > diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h > > index 4d8d821..9fce4e4 100644 > > --- v3.6-rc1.orig/include/linux/pagemap.h > > +++ v3.6-rc1/include/linux/pagemap.h > > @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) > > { > > return test_bit(AS_HWPOISON, &mapping->flags); > > } > > +static inline void mapping_set_hwpoison(struct address_space *mapping) > > +{ > > + set_bit(AS_HWPOISON, &mapping->flags); > > +} > > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > > +{ > > + clear_bit(AS_HWPOISON, &mapping->flags); > > +} > > #else > > static inline int mapping_hwpoison(struct address_space *mapping) > > { > > return 0; > > } > > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > > +{ > > +} > > #endif > > > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > > index a1e7e00..ca064c6 100644 > > --- v3.6-rc1.orig/mm/memory-failure.c > > +++ v3.6-rc1/mm/memory-failure.c > > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) > > * the first EIO, but we're not worse than other parts > > * of the kernel. > > */ > > - set_bit(AS_HWPOISON, &mapping->flags); > > + mapping_set_hwpoison(mapping); > > } > > > > return me_pagecache_clean(p, pfn); > > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > > index 75801ac..82a994f 100644 > > --- v3.6-rc1.orig/mm/truncate.c > > +++ v3.6-rc1/mm/truncate.c > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > oldsize = inode->i_size; > > i_size_write(inode, newsize); > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > It might be a bit better to test !newsize first. Ah, OK. Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-23 20:31 ` Naoya Horiguchi @ 2012-08-24 21:52 ` Naoya Horiguchi 0 siblings, 0 replies; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-24 21:52 UTC (permalink / raw) To: Naoya Horiguchi Cc: Wu Fengguang, Andi Kleen, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel Hello, On Thu, Aug 23, 2012 at 04:31:43PM -0400, Naoya Horiguchi wrote: > On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote: > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: ... > > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > > > index ac8d904..8742397 100644 > > > --- v3.6-rc1.orig/fs/inode.c > > > +++ v3.6-rc1/fs/inode.c > > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) > > > } > > > > > > /* > > > + * Keep inode caches on memory for user processes to certainly > > > + * be aware of memory errors. > > > + */ > > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > > > + spin_unlock(&inode->i_lock); > > > + continue; > > > + } > > > > That chunk prevents reclaiming all the cached pages. However the intention > > is only to keep the struct inode together with the hwpoison bit? > > Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from > shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file > remain for long under high memory pressure. I might lose your point. Are you suggesting this chunk should come after if (inode_has_buffers(inode) || inode->i_data.nrpages) { ... } block, aren't you? I think that's right, so I'll try and test it this weekend. > > > + /* > > > * Referenced or dirty inodes are still in use. Give them > > > * another pass through the LRU as we canot reclaim them now. > > > */ > > > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) > > > inode->i_state &= ~I_WILL_FREE; > > > } > > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > Is that clear necessary? Because the bit will be gone with the inode > > struct: it's going to be de-allocated anyway. > > With the chunk in prune_icache_sb() we keep the inode struct with > AS_HWPOISON set on memory, so in order to remove it, we need explicitly > clear the bit. > Without this clear, the inode remains until system reboot. And again, you are right here. Without this clear, this inode will be cleared in destroy_inode(). Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-22 15:17 ` [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Naoya Horiguchi 2012-08-23 9:11 ` Fengguang Wu @ 2012-08-24 1:31 ` Dave Chinner 2012-08-24 2:39 ` Naoya Horiguchi 1 sibling, 1 reply; 22+ messages in thread From: Dave Chinner @ 2012-08-24 1:31 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > where we have possibilities of data lost. This is because in this fix > AS_HWPOISON is cleared when the inode cache is dropped. > > For example, consider an application in which a process periodically > (every 10 minutes) writes some logs on a file (and closes it after > each writes,) and at the end of each day some batch programs run using > the log file. If a memory error hits on dirty pagecache of this log file > just after periodic write/close and the inode cache is cleared before the > next write, then this application is not aware of the error and the batch > programs will work wrongly. > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > until we remove or completely truncate the hwpoisoned file. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > --- > fs/inode.c | 12 ++++++++++++ > include/linux/pagemap.h | 11 +++++++++++ > mm/memory-failure.c | 2 +- > mm/truncate.c | 2 ++ > 4 files changed, 26 insertions(+), 1 deletion(-) > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > index ac8d904..8742397 100644 > --- v3.6-rc1.orig/fs/inode.c > +++ v3.6-rc1/fs/inode.c > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) > } > > /* > + * Keep inode caches on memory for user processes to certainly > + * be aware of memory errors. > + */ > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + > + /* > * Referenced or dirty inodes are still in use. Give them > * another pass through the LRU as we canot reclaim them now. > */ I don't think you tested this at all. Have a look at what the loop does more closely - inodes with poisoned mappings will get stuck and reclaim doesn't make progress past them. I think you also need to document this inode lifecycle change.... > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > index 75801ac..82a994f 100644 > --- v3.6-rc1.orig/mm/truncate.c > +++ v3.6-rc1/mm/truncate.c > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > oldsize = inode->i_size; > i_size_write(inode, newsize); > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > + mapping_clear_hwpoison(inode->i_mapping); So only a truncate to zero size will clear the poison flag? What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? What about a hole punch over the poisoned range? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-24 1:31 ` Dave Chinner @ 2012-08-24 2:39 ` Naoya Horiguchi 2012-08-24 4:39 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-24 2:39 UTC (permalink / raw) To: david Cc: Naoya Horiguchi, Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > where we have possibilities of data lost. This is because in this fix > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > For example, consider an application in which a process periodically > > (every 10 minutes) writes some logs on a file (and closes it after > > each writes,) and at the end of each day some batch programs run using > > the log file. If a memory error hits on dirty pagecache of this log file > > just after periodic write/close and the inode cache is cleared before the > > next write, then this application is not aware of the error and the batch > > programs will work wrongly. > > > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > > until we remove or completely truncate the hwpoisoned file. > > > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > --- > > fs/inode.c | 12 ++++++++++++ > > include/linux/pagemap.h | 11 +++++++++++ > > mm/memory-failure.c | 2 +- > > mm/truncate.c | 2 ++ > > 4 files changed, 26 insertions(+), 1 deletion(-) > > > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > > index ac8d904..8742397 100644 > > --- v3.6-rc1.orig/fs/inode.c > > +++ v3.6-rc1/fs/inode.c > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) > > } > > > > /* > > + * Keep inode caches on memory for user processes to certainly > > + * be aware of memory errors. > > + */ > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > > + spin_unlock(&inode->i_lock); > > + continue; > > + } > > + > > + /* > > * Referenced or dirty inodes are still in use. Give them > > * another pass through the LRU as we canot reclaim them now. > > */ > > I don't think you tested this at all. Have a look at what the loop > does more closely - inodes with poisoned mappings will get stuck > and reclaim doesn't make progress past them. Sorry, I overlooked something important in my testing. I'll correct it. Maybe we need list_move_tail() in this block. > I think you also need to document this inode lifecycle change.... OK, I'll do it. > > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > > index 75801ac..82a994f 100644 > > --- v3.6-rc1.orig/mm/truncate.c > > +++ v3.6-rc1/mm/truncate.c > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > oldsize = inode->i_size; > > i_size_write(inode, newsize); > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > + mapping_clear_hwpoison(inode->i_mapping); > > So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. > What happens if it is the last page in the mapping that is poisoned, > and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. > What about a hole punch over the poisoned range? For the same reason, this is also not related to when to clear the bit. Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-24 2:39 ` Naoya Horiguchi @ 2012-08-24 4:39 ` Dave Chinner 2012-08-24 17:24 ` Naoya Horiguchi 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2012-08-24 4:39 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > where we have possibilities of data lost. This is because in this fix > > > AS_HWPOISON is cleared when the inode cache is dropped. .... > > > --- v3.6-rc1.orig/mm/truncate.c > > > +++ v3.6-rc1/mm/truncate.c > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > > > oldsize = inode->i_size; > > > i_size_write(inode, newsize); > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > So only a truncate to zero size will clear the poison flag? > > Yes, this is because we only know if the file is affected by hwpoison, > but not where the hwpoisoned page is in the file. We could remember it, > but I did not do it for simplicity. Surely the page has flags on it to say it is poisoned? That is, after truncating the page cache, if the address space is poisoned, then you can do a pass across the mapping tree checking if there are any poisoned pages left? Or perhaps adding a new mapping tree tag so that the poisoned status is automatically determined by the presence of the poisoned page in the mapping tree? > > What happens if it is the last page in the mapping that is poisoned, > > and we truncate that away? Shouldn't that clear the poisoned bit? > > When we handle the hwpoisoned inode, the error page should already > be removed from pagecache, so the remaining pages are not related > to the error and we need not care about them when we consider bit > clearing. Sorry, I don't follow. What removed the page from the page cache? The truncate_page_cache() call that follows the above code hunk is what does that, so I don't see how it can already be removed from the page cache.... > > What about a hole punch over the poisoned range? > > For the same reason, this is also not related to when to clear the bit. Sure it is - if you remove the poisoned pages from the mapping when the hole is punched, then the mapping is no longer poisoned. Hence the bit should be cleared at that time as nothing else will ever clear it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-24 4:39 ` Dave Chinner @ 2012-08-24 17:24 ` Naoya Horiguchi 2012-08-26 22:26 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-24 17:24 UTC (permalink / raw) To: Dave Chinner Cc: Naoya Horiguchi, Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > > where we have possibilities of data lost. This is because in this fix > > > > AS_HWPOISON is cleared when the inode cache is dropped. > .... > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > +++ v3.6-rc1/mm/truncate.c > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > > > > > oldsize = inode->i_size; > > > > i_size_write(inode, newsize); > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > So only a truncate to zero size will clear the poison flag? > > > > Yes, this is because we only know if the file is affected by hwpoison, > > but not where the hwpoisoned page is in the file. We could remember it, > > but I did not do it for simplicity. > > Surely the page has flags on it to say it is poisoned? That is, > after truncating the page cache, if the address space is poisoned, > then you can do a pass across the mapping tree checking if there are > any poisoned pages left? Or perhaps adding a new mapping tree tag so > that the poisoned status is automatically determined by the presence > of the poisoned page in the mapping tree? The answer for the first question is yes. And for the second question, I don't think it's easy because the mapping tree has no reference to the error page (I explain more about this below, please see also it,) and it can cost too much to search poisoned pages over page cache in each request. And for the third question, I think we could do this, but to do it we need an additional space (8 bytes) in struct radix_tree_node. Considering that this new tag is not used so frequently, so I'm not sure that it's worth the cost. > > > What happens if it is the last page in the mapping that is poisoned, > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > When we handle the hwpoisoned inode, the error page should already > > be removed from pagecache, so the remaining pages are not related > > to the error and we need not care about them when we consider bit > > clearing. > > Sorry, I don't follow. What removed the page from the page cache? > The truncate_page_cache() call that follows the above code hunk is > what does that, so I don't see how it can already be removed from > the page cache.... Memory error handler (memory_failure() in mm/memory-failure.c) has removed the error page from the page cache. And truncate_page_cache() that follows this hunk removes all pages belonging to the page cache of the poisoned file (where the error page itself is not included in them.) Let me explain more to clarify my whole scenario. If a memory error hits on a dirty pagecache, kernel works like below: 1. handles a MCE interrupt (logging MCE events,) 2. calls memory error handler (doing 3 to 6,) 3. sets PageHWPoison flag on the error page, 4. unmaps all mappings to processes' virtual addresses, 5. sets AS_HWPOISON on mappings to which the error page belongs 6. invalidates the error page (unlinks it from LRU list and removes it from pagecache,) (memory error handler finished) 7. later accesses to the file returns -EIO, 8. AS_HWPOISON is cleared when the file is removed or completely truncated. This patchset tries to fix the problem in 7 and add a new behavior 8, where I have an assumption that 1-6 has already worked out. You may think it strange that the condition of clearing AS_HWPOISON is checked with file granularity. This is because currently userspace applications know the memory errors only with file granularity for simplicity, when they access via read(), write() and/or fsync(). (Only for access via mmap(), error reporting is with page granularity.) We can do it with page granularity, and I tried 2 approaches: a. "adding a new tag in mapping tree" approach as I explained above, this needs an additional space on heavily used data structure, b. "adding a new data structure specific to dirty pagecache error management" approach I did this in the 1st version of this patchset, but found that it's too complicated as the first step. But I think both of these are some difficulty to be accepted, so at first I try to start with simpler one. > > > What about a hole punch over the poisoned range? > > > > For the same reason, this is also not related to when to clear the bit. > > Sure it is - if you remove the poisoned pages from the mapping when > the hole is punched, then the mapping is no longer poisoned. Hence > the bit should be cleared at that time as nothing else will ever > clear it. If we achieve error reporting with the page granularity, I hope we can also do this easily. Do I answer all your questions? If not, please let me know. Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-24 17:24 ` Naoya Horiguchi @ 2012-08-26 22:26 ` Dave Chinner 2012-08-27 22:05 ` Naoya Horiguchi 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2012-08-26 22:26 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > > > where we have possibilities of data lost. This is because in this fix > > > > > AS_HWPOISON is cleared when the inode cache is dropped. > > .... > > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > > +++ v3.6-rc1/mm/truncate.c > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > > > > > > > oldsize = inode->i_size; > > > > > i_size_write(inode, newsize); > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > > > So only a truncate to zero size will clear the poison flag? > > > > > > Yes, this is because we only know if the file is affected by hwpoison, > > > but not where the hwpoisoned page is in the file. We could remember it, > > > but I did not do it for simplicity. > > > > Surely the page has flags on it to say it is poisoned? That is, > > after truncating the page cache, if the address space is poisoned, > > then you can do a pass across the mapping tree checking if there are > > any poisoned pages left? Or perhaps adding a new mapping tree tag so > > that the poisoned status is automatically determined by the presence > > of the poisoned page in the mapping tree? > > The answer for the first question is yes. And for the second question, > I don't think it's easy because the mapping tree has no reference to > the error page (I explain more about this below, please see also it,) > and it can cost too much to search poisoned pages over page cache in > each request. Which is my point about a radix tree tag - that's very efficient. > And for the third question, I think we could do this, but to do it > we need an additional space (8 bytes) in struct radix_tree_node. > Considering that this new tag is not used so frequently, so I'm not > sure that it's worth the cost. A radix tree node is currently 560 bytes on x86_64, packed 7 to a page. i.e. using 3920 bytes. We can add another 8 bytes to it without increasing memory usage at all. So, no cost at all. > > > > What happens if it is the last page in the mapping that is poisoned, > > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > > > When we handle the hwpoisoned inode, the error page should already > > > be removed from pagecache, so the remaining pages are not related > > > to the error and we need not care about them when we consider bit > > > clearing. > > > > Sorry, I don't follow. What removed the page from the page cache? > > The truncate_page_cache() call that follows the above code hunk is > > what does that, so I don't see how it can already be removed from > > the page cache.... > > Memory error handler (memory_failure() in mm/memory-failure.c) has > removed the error page from the page cache. > And truncate_page_cache() that follows this hunk removes all pages > belonging to the page cache of the poisoned file (where the error > page itself is not included in them.) > > Let me explain more to clarify my whole scenario. If a memory error > hits on a dirty pagecache, kernel works like below: > > 1. handles a MCE interrupt (logging MCE events,) > 2. calls memory error handler (doing 3 to 6,) > 3. sets PageHWPoison flag on the error page, > 4. unmaps all mappings to processes' virtual addresses, So nothing in userspace sees the bad page after this. > 5. sets AS_HWPOISON on mappings to which the error page belongs > 6. invalidates the error page (unlinks it from LRU list and removes > it from pagecache,) > (memory error handler finished) Ok, so the moment a memory error is handled, the page has been removed from the inode's mapping, and it will never be seen by aplications again. It's a transient error.... > 7. later accesses to the file returns -EIO, > 8. AS_HWPOISON is cleared when the file is removed or completely > truncated. .... so why do we have to keep an EIO on the inode forever? If the page is not dirty, then just tossing it from the cache (as is already done) and rereading it from disk next time it is accessed removes the need for any error to be reported at all. It's effectively a transient error at this point, and as such no errors should be visible from userspace. If the page is dirty, then it needs to be treated just like any other failed page write - the page is invalidated and the address space is marked with AS_EIO, and that is reported to the next operation that waits on IO on that file (i.e. fsync) If you have a second application that reads the files that depends on a guarantee of good data, then the first step in that process is that application that writes it needs to use fsync to check the data was written correctly. That ensures that you only have clean pages in the cache before the writer closes the file, and any h/w error then devolves to the above transient clean page invalidation case. Hence I fail to see why this type of IO error needs to be sticky. The error on the mapping is transient - it is gone as soon as the page is removed from the mapping. Hence the error can be dropped as soon as it is reported to userspace because the mapping is now error free. > You may think it strange that the condition of clearing AS_HWPOISON > is checked with file granularity. I don't think it is strange, I think it is *wrong*. > This is because currently userspace > applications know the memory errors only with file granularity for > simplicity, when they access via read(), write() and/or fsync(). Trying to report this error to every potential future access regardless of whether the error still exists or the access is to the poisoned range with magical sticky errors on inode address spaces and hacks to memory the reclaim subsystems smells to me like a really bad hack to work around applications that use bad data integrity practices. As such, I think you probably need to rethink the approach you are taking to handling this error. The error is transient w.r.t. to the mapping and page cache, and needs to be addressed consistently compared to other transient IO errors that are reported through the mapping.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-26 22:26 ` Dave Chinner @ 2012-08-27 22:05 ` Naoya Horiguchi 2012-08-29 2:59 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-27 22:05 UTC (permalink / raw) To: Dave Chinner Cc: Naoya Horiguchi, Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > > > > where we have possibilities of data lost. This is because in this fix > > > > > > AS_HWPOISON is cleared when the inode cache is dropped. > > > .... > > > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > > > +++ v3.6-rc1/mm/truncate.c > > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > > > > > > > > > > > oldsize = inode->i_size; > > > > > > i_size_write(inode, newsize); > > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > > > > > So only a truncate to zero size will clear the poison flag? > > > > > > > > Yes, this is because we only know if the file is affected by hwpoison, > > > > but not where the hwpoisoned page is in the file. We could remember it, > > > > but I did not do it for simplicity. > > > > > > Surely the page has flags on it to say it is poisoned? That is, > > > after truncating the page cache, if the address space is poisoned, > > > then you can do a pass across the mapping tree checking if there are > > > any poisoned pages left? Or perhaps adding a new mapping tree tag so > > > that the poisoned status is automatically determined by the presence > > > of the poisoned page in the mapping tree? > > > > The answer for the first question is yes. And for the second question, > > I don't think it's easy because the mapping tree has no reference to > > the error page (I explain more about this below, please see also it,) > > and it can cost too much to search poisoned pages over page cache in > > each request. > > Which is my point about a radix tree tag - that's very efficient. > > > And for the third question, I think we could do this, but to do it > > we need an additional space (8 bytes) in struct radix_tree_node. > > Considering that this new tag is not used so frequently, so I'm not > > sure that it's worth the cost. > > A radix tree node is currently 560 bytes on x86_64, packed 7 to a > page. i.e. using 3920 bytes. We can add another 8 bytes to it > without increasing memory usage at all. So, no cost at all. OK. > > > > > What happens if it is the last page in the mapping that is poisoned, > > > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > > > > > When we handle the hwpoisoned inode, the error page should already > > > > be removed from pagecache, so the remaining pages are not related > > > > to the error and we need not care about them when we consider bit > > > > clearing. > > > > > > Sorry, I don't follow. What removed the page from the page cache? > > > The truncate_page_cache() call that follows the above code hunk is > > > what does that, so I don't see how it can already be removed from > > > the page cache.... > > > > Memory error handler (memory_failure() in mm/memory-failure.c) has > > removed the error page from the page cache. > > And truncate_page_cache() that follows this hunk removes all pages > > belonging to the page cache of the poisoned file (where the error > > page itself is not included in them.) > > > > Let me explain more to clarify my whole scenario. If a memory error > > hits on a dirty pagecache, kernel works like below: > > > > 1. handles a MCE interrupt (logging MCE events,) > > 2. calls memory error handler (doing 3 to 6,) > > 3. sets PageHWPoison flag on the error page, > > 4. unmaps all mappings to processes' virtual addresses, > > So nothing in userspace sees the bad page after this. > > > 5. sets AS_HWPOISON on mappings to which the error page belongs > > 6. invalidates the error page (unlinks it from LRU list and removes > > it from pagecache,) > > (memory error handler finished) > > Ok, so the moment a memory error is handled, the page has been > removed from the inode's mapping, and it will never be seen by > aplications again. It's a transient error.... > > > 7. later accesses to the file returns -EIO, > > 8. AS_HWPOISON is cleared when the file is removed or completely > > truncated. > > .... so why do we have to keep an EIO on the inode forever? > > If the page is not dirty, then just tossing it from the cache (as > is already done) and rereading it from disk next time it is accessed > removes the need for any error to be reported at all. It's > effectively a transient error at this point, and as such no errors > should be visible from userspace. > > If the page is dirty, then it needs to be treated just like any > other failed page write - the page is invalidated and the address > space is marked with AS_EIO, and that is reported to the next > operation that waits on IO on that file (i.e. fsync) > > If you have a second application that reads the files that depends > on a guarantee of good data, then the first step in that process is > that application that writes it needs to use fsync to check the data > was written correctly. That ensures that you only have clean pages > in the cache before the writer closes the file, and any h/w error > then devolves to the above transient clean page invalidation case. Thank you for detailed explanations. And yes, I understand it's ideal, but many applications choose not to do that for performance reason. So I think it's helpful if we can surely report to such applications. > Hence I fail to see why this type of IO error needs to be sticky. > The error on the mapping is transient - it is gone as soon as the > page is removed from the mapping. Hence the error can be dropped as > soon as it is reported to userspace because the mapping is now error > free. It's error free only for the applications which do fsync check in each write, but not for the applications which don't do. I think the penalty for the latters (ignore dirty data lost and get wrong results) is too big to consider it as a reasonable trade-off. > > You may think it strange that the condition of clearing AS_HWPOISON > > is checked with file granularity. > > I don't think it is strange, I think it is *wrong*. OK, we can move to a new tag approach, and we can avoid this. > > This is because currently userspace > > applications know the memory errors only with file granularity for > > simplicity, when they access via read(), write() and/or fsync(). > > Trying to report this error to every potential future access > regardless of whether the error still exists or the access is to the > poisoned range with magical sticky errors on inode address spaces > and hacks to memory the reclaim subsystems smells to me like a really > bad hack to work around applications that use bad data integrity > practices. > > As such, I think you probably need to rethink the approach you are > taking to handling this error. The error is transient w.r.t. to the > mapping and page cache, and needs to be addressed consistently > compared to other transient IO errors that are reported through the > mapping.... Agreed. We can handle this error without a controversial flag on the mapping, in new pagecache tag approach. I'll try that one. Thanks, Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-27 22:05 ` Naoya Horiguchi @ 2012-08-29 2:59 ` Dave Chinner 2012-08-29 5:32 ` Jun'ichi Nomura 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2012-08-29 2:59 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: > On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: > > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > > > Let me explain more to clarify my whole scenario. If a memory error > > > hits on a dirty pagecache, kernel works like below: > > > > > > 1. handles a MCE interrupt (logging MCE events,) > > > 2. calls memory error handler (doing 3 to 6,) > > > 3. sets PageHWPoison flag on the error page, > > > 4. unmaps all mappings to processes' virtual addresses, > > > > So nothing in userspace sees the bad page after this. > > > > > 5. sets AS_HWPOISON on mappings to which the error page belongs > > > 6. invalidates the error page (unlinks it from LRU list and removes > > > it from pagecache,) > > > (memory error handler finished) > > > > Ok, so the moment a memory error is handled, the page has been > > removed from the inode's mapping, and it will never be seen by > > aplications again. It's a transient error.... > > > > > 7. later accesses to the file returns -EIO, > > > 8. AS_HWPOISON is cleared when the file is removed or completely > > > truncated. > > > > .... so why do we have to keep an EIO on the inode forever? > > > > If the page is not dirty, then just tossing it from the cache (as > > is already done) and rereading it from disk next time it is accessed > > removes the need for any error to be reported at all. It's > > effectively a transient error at this point, and as such no errors > > should be visible from userspace. > > > > If the page is dirty, then it needs to be treated just like any > > other failed page write - the page is invalidated and the address > > space is marked with AS_EIO, and that is reported to the next > > operation that waits on IO on that file (i.e. fsync) > > > > If you have a second application that reads the files that depends > > on a guarantee of good data, then the first step in that process is > > that application that writes it needs to use fsync to check the data > > was written correctly. That ensures that you only have clean pages > > in the cache before the writer closes the file, and any h/w error > > then devolves to the above transient clean page invalidation case. > > Thank you for detailed explanations. > And yes, I understand it's ideal, but many applications choose not to > do that for performance reason. You choose: data integrity or performance. > So I think it's helpful if we can surely report to such applications. If performance is chosen over data integrity, we are under no obligation to keep the error around indefinitely. Fundamentally, ensuring a write completes successfully is the reponsibility of the application, not the kernel. There are so many different filesytem and storage errors that can be lost right now because data is not fsync()d, adding another one to them really doesn't change anything. IOWs, a memory error is no different to a disk failing or the system crashing when it comes to data integrity. If you care, you use fsync(). > > Hence I fail to see why this type of IO error needs to be sticky. > > The error on the mapping is transient - it is gone as soon as the > > page is removed from the mapping. Hence the error can be dropped as > > soon as it is reported to userspace because the mapping is now error > > free. > > It's error free only for the applications which do fsync check in > each write, but not for the applications which don't do. > I think the penalty for the latters (ignore dirty data lost and get > wrong results) is too big to consider it as a reasonable trade-off. I'm guessing that you don't deal with data integrity issues very often. What you are suggesting is not a reasonable tradeoff - either applications are coded correctly for data integrity, or they give up any expectation that errors will be detected and reported reliably. Hoping that we might be able to report an error somewhere to someone who didn't care to avoid or collect in the first place does not improve the situation for anyone.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-29 2:59 ` Dave Chinner @ 2012-08-29 5:32 ` Jun'ichi Nomura 2012-09-03 0:39 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Jun'ichi Nomura @ 2012-08-29 5:32 UTC (permalink / raw) To: Dave Chinner Cc: Naoya Horiguchi, Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, linux-mm, linux-kernel On 08/29/12 11:59, Dave Chinner wrote: > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: >> And yes, I understand it's ideal, but many applications choose not to >> do that for performance reason. >> So I think it's helpful if we can surely report to such applications. I suspect "performance vs. integrity" is not a correct description of the problem. > If performance is chosen over data integrity, we are under no > obligation to keep the error around indefinitely. Fundamentally, > ensuring a write completes successfully is the reponsibility of the > application, not the kernel. There are so many different filesytem > and storage errors that can be lost right now because data is not > fsync()d, adding another one to them really doesn't change anything. > IOWs, a memory error is no different to a disk failing or the system > crashing when it comes to data integrity. If you care, you use > fsync(). I agree that applications should fsync() or O_SYNC when it wants to make sure the written data in on disk. AFAIU, what Naoya is going to address is the case where fsync() is not necessarily needed. For example, if someone do: $ patch -p1 < ../a.patch $ tar cf . > ../a.tar and disk failure occurred between "patch" and "tar", "tar" will either see uptodate data or I/O error. OTOH, if the failure was detected on dirty pagecache, the current memory failure handler invalidates the dirty page and the "tar" command will re-read old contents from disk without error. (Well, the failures above are permanent failures. IOW, the current memory failure handler turns permanent failure into transient error, which is often more difficult to handle, I think.) Naoya's patch will keep the failure information and allows the reader to get I/O error when it reads from broken pagecache. -- Jun'ichi Nomura, NEC Corporation ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky 2012-08-29 5:32 ` Jun'ichi Nomura @ 2012-09-03 0:39 ` Dave Chinner 0 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2012-09-03 0:39 UTC (permalink / raw) To: Jun'ichi Nomura Cc: Naoya Horiguchi, Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, linux-mm, linux-kernel On Wed, Aug 29, 2012 at 02:32:04PM +0900, Jun'ichi Nomura wrote: > On 08/29/12 11:59, Dave Chinner wrote: > > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: > >> And yes, I understand it's ideal, but many applications choose not to > >> do that for performance reason. > >> So I think it's helpful if we can surely report to such applications. > > I suspect "performance vs. integrity" is not a correct > description of the problem. Right, to be more precise, it's a "eat my data" vs "integrity" problem. And in almost all cases I've seen over the years, "eat my data" is done for performance reasons... > > If performance is chosen over data integrity, we are under no > > obligation to keep the error around indefinitely. Fundamentally, > > ensuring a write completes successfully is the reponsibility of the > > application, not the kernel. There are so many different filesytem > > and storage errors that can be lost right now because data is not > > fsync()d, adding another one to them really doesn't change anything. > > IOWs, a memory error is no different to a disk failing or the system > > crashing when it comes to data integrity. If you care, you use > > fsync(). > > I agree that applications should fsync() or O_SYNC > when it wants to make sure the written data in on disk. > > AFAIU, what Naoya is going to address is the case where > fsync() is not necessarily needed. > > For example, if someone do: > $ patch -p1 < ../a.patch > $ tar cf . > ../a.tar > > and disk failure occurred between "patch" and "tar", > "tar" will either see uptodate data or I/O error. No, it won't. The only place AS_EIO is tested is in filemap_fdatawait_range(), which is only called in the fsync() path. The only way to report async write IO errors is to use fsync() - subsequent reads of the file do *not* see the write error. IOWs, tar will be oblivious of any IO error that preceeded it reading the files it is copying. > OTOH, if the failure was detected on dirty pagecache, the current memory > failure handler invalidates the dirty page and the "tar" command will > re-read old contents from disk without error. After an IO error, the dirty page is no longer uptodate - that gets cleared - so when the page is read the data will be re-read from disk just like if a memory error occurred. So tar will behave the same regardless of whether it is a memory error or an IO error (i.e. reread old data from disk) > (Well, the failures above are permanent failures. Write IO errors can also be transient or permanent. Transient, for example, when a path failure occurs and multipathing then detects this and fails over to a good path. A subsequent write will then succeed. Permanent, for example, when someone unplugs a USB drive. > IOW, the current > memory failure handler turns permanent failure into transient error, > which is often more difficult to handle, I think.) The patch I commented on is turning a transient error (error in a page that is then poisoned and never used again) into a permanent error (error on an address space that is reported on every future operation that tries to insert a page into the page cache). > Naoya's patch will keep the failure information and allows the reader > to get I/O error when it reads from broken pagecache. It only adds a hwposion check in add_to_page_cache_locked(). If the page is already in the cache, then no error will be sent to the reader because it never gets to add_to_page_cache_locked(). So there's no guarantee that the reader is even going to see the error, or that they see the error on the page that actually caused it - access to any missing page in the page cache will trigger it. And as memory reclaim clears pages off the inode, more and more of the range of the inode's data will return an error, even though there is nothing wrong with the data in most of the file. Indeed, what happens if the application calls fsync, gets the error and tries to rewrite the page? i.e. it does everything correctly to handle the write error? With this patch set, the application cannot insert a replacement page into the page cache, so all subsequent writes fail! IOWs, it makes it impossible for applications to recover from a detected and handled memory failure. I have no issue with reporting the problem to userspace - that needs t am I saying that the current IO reporting is wonderful and can't be improved. What I am saying, though, is that I really don't think this patch set has been well thought through from either an IO path or userspace error handling point of view. The problems with this patch set are quite significant: - permanent, unclearable error except by complete removal of all data on the file. (forcing the removal of all good data to recover from a transient error on a small amount of data). - while the error is present, bad data cannot be overwritten (applications cannot recover even if they catch the error). - while the error is present, no new data can be written (applications can't continue even if they don't care about the error). - while the error is present, no valid data can be read from the file (applications can't access good data they need to run even after the error has been handled). - memory reclaim will slowly remove uptodate cached pages and so re-reading good cached pages can suddenly return errors even though no new error has been encountered. (error gets worse over time until all good data is removed or the system is rebooted). The first thing that you need to do is make sure applications can recover from a detected (via fsync) hwpoisoning event on a dirty page in the page cache. Once you have that working, then handle the case of errors on clean pages (e.g. hwpoison a libc dso page and see if the machine continues to operate). Once you have a system resilient to clean page errors and dirty page errors for applications that care about data integrity, then you can start thinking about making stuff better for applications that don't care about data integrity.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting 2012-08-22 15:17 [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Naoya Horiguchi ` (2 preceding siblings ...) 2012-08-22 15:17 ` [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Naoya Horiguchi @ 2012-08-22 20:22 ` Andi Kleen 2012-08-22 21:14 ` Naoya Horiguchi 3 siblings, 1 reply; 22+ messages in thread From: Andi Kleen @ 2012-08-22 20:22 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes: > Hi, > > Based on the previous discussion, in this version I propose only error > reporting fix ("overwrite recovery" is sparated out from this series.) > > I think Fengguang's patch (patch 2 in this series) has a corner case > about inode cache drop, so I added patch 3 for it. New patchkit looks very reasonable to me. I haven't gone through all the corner cases the inode pinning may have though. You probably want an review from Al Viro on this. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting 2012-08-22 20:22 ` [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Andi Kleen @ 2012-08-22 21:14 ` Naoya Horiguchi 0 siblings, 0 replies; 22+ messages in thread From: Naoya Horiguchi @ 2012-08-22 21:14 UTC (permalink / raw) To: Andi Kleen, Al Viro Cc: Naoya Horiguchi, Andi Kleen, Wu Fengguang, Andrew Morton, Tony Luck, Rik van Riel, Jun'ichi Nomura, linux-mm, linux-kernel On Wed, Aug 22, 2012 at 01:22:36PM -0700, Andi Kleen wrote: > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes: > > > Hi, > > > > Based on the previous discussion, in this version I propose only error > > reporting fix ("overwrite recovery" is sparated out from this series.) > > > > I think Fengguang's patch (patch 2 in this series) has a corner case > > about inode cache drop, so I added patch 3 for it. > > New patchkit looks very reasonable to me. OK, thanks. > I haven't gone through all the corner cases the inode pinning may > have though. You probably want an review from Al Viro on this. Yes. Al, can I have your review on patch 3 in this series? https://lkml.org/lkml/2012/8/22/400 Naoya ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2012-09-03 0:40 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-08-22 15:17 [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean Naoya Horiguchi 2012-08-23 9:33 ` Fengguang Wu 2012-08-23 20:31 ` Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file Naoya Horiguchi 2012-08-23 9:22 ` Fengguang Wu 2012-08-23 20:31 ` Naoya Horiguchi 2012-08-22 15:17 ` [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Naoya Horiguchi 2012-08-23 9:11 ` Fengguang Wu 2012-08-23 20:31 ` Naoya Horiguchi 2012-08-24 21:52 ` Naoya Horiguchi 2012-08-24 1:31 ` Dave Chinner 2012-08-24 2:39 ` Naoya Horiguchi 2012-08-24 4:39 ` Dave Chinner 2012-08-24 17:24 ` Naoya Horiguchi 2012-08-26 22:26 ` Dave Chinner 2012-08-27 22:05 ` Naoya Horiguchi 2012-08-29 2:59 ` Dave Chinner 2012-08-29 5:32 ` Jun'ichi Nomura 2012-09-03 0:39 ` Dave Chinner 2012-08-22 20:22 ` [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Andi Kleen 2012-08-22 21:14 ` Naoya Horiguchi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).