[PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-01 16:26 ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-01 16:26 UTC (permalink / raw)
  To: linux-mm; +Cc: LKML, xfs, Jan Kara, Martin Schwidefsky, Mel Gorman, linux-s390

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard write, HW
dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
filesystems. The bug we observed in practice is that buffers from the page get
freed, so when the page gets later marked as dirty and writeback writes it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
from xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then page
unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages in
page_mkclean() and page_remove_rmap(). This is safe because when a page gets
marked as writeable in PTE it is also marked dirty in do_wp_page() or
do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
the page gets writeprotected in page_mkclean(). So pagecache page is writeable
if and only if it is dirty.

CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-s390@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/rmap.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..6ce8ddb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
+			/*
+			 * We ignore dirty bit for pagecache pages. It is safe
+			 * as page is marked dirty iff it is writeable (page is
+			 * marked as dirty when it is made writeable and
+			 * clear_page_dirty_for_io() writeprotects the page
+			 * again).
+			 */
+			if (PageSwapCache(page) &&
+			    page_test_and_clear_dirty(page_to_pfn(page), 1))
 				ret = 1;
 		}
 	}
@@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 * For pagecache pages, we don't care about dirty bit in storage
+	 * key because the page is writeable iff it is dirty (page is marked
+	 * as dirty when it is made writeable and clear_page_dirty_for_io()
+	 * writeprotects the page again).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (PageSwapCache(page) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-01 16:26 ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-01 16:26 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-s390, Jan Kara, LKML, xfs, Mel Gorman, Martin Schwidefsky

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard write, HW
dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
filesystems. The bug we observed in practice is that buffers from the page get
freed, so when the page gets later marked as dirty and writeback writes it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
from xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then page
unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages in
page_mkclean() and page_remove_rmap(). This is safe because when a page gets
marked as writeable in PTE it is also marked dirty in do_wp_page() or
do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
the page gets writeprotected in page_mkclean(). So pagecache page is writeable
if and only if it is dirty.

CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-s390@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/rmap.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..6ce8ddb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
+			/*
+			 * We ignore dirty bit for pagecache pages. It is safe
+			 * as page is marked dirty iff it is writeable (page is
+			 * marked as dirty when it is made writeable and
+			 * clear_page_dirty_for_io() writeprotects the page
+			 * again).
+			 */
+			if (PageSwapCache(page) &&
+			    page_test_and_clear_dirty(page_to_pfn(page), 1))
 				ret = 1;
 		}
 	}
@@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 * For pagecache pages, we don't care about dirty bit in storage
+	 * key because the page is writeable iff it is dirty (page is marked
+	 * as dirty when it is made writeable and clear_page_dirty_for_io()
+	 * writeprotects the page again).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (PageSwapCache(page) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-01 16:26 ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-01 16:26 UTC (permalink / raw)
  To: linux-mm; +Cc: LKML, xfs, Jan Kara, Martin Schwidefsky, Mel Gorman, linux-s390

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard write, HW
dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
filesystems. The bug we observed in practice is that buffers from the page get
freed, so when the page gets later marked as dirty and writeback writes it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
from xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then page
unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages in
page_mkclean() and page_remove_rmap(). This is safe because when a page gets
marked as writeable in PTE it is also marked dirty in do_wp_page() or
do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
the page gets writeprotected in page_mkclean(). So pagecache page is writeable
if and only if it is dirty.

CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-s390@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/rmap.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..6ce8ddb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
+			/*
+			 * We ignore dirty bit for pagecache pages. It is safe
+			 * as page is marked dirty iff it is writeable (page is
+			 * marked as dirty when it is made writeable and
+			 * clear_page_dirty_for_io() writeprotects the page
+			 * again).
+			 */
+			if (PageSwapCache(page) &&
+			    page_test_and_clear_dirty(page_to_pfn(page), 1))
 				ret = 1;
 		}
 	}
@@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 * For pagecache pages, we don't care about dirty bit in storage
+	 * key because the page is writeable iff it is dirty (page is marked
+	 * as dirty when it is made writeable and clear_page_dirty_for_io()
+	 * writeprotects the page again).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (PageSwapCache(page) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-01 16:26 ` Jan Kara
  (?)
@ 2012-10-08 14:28   ` Mel Gorman
  -1 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2012-10-08 14:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, linux-s390

On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote:
> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
> 
> CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-s390@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-08 14:28   ` Mel Gorman
  0 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2012-10-08 14:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: Martin Schwidefsky, linux-mm, linux-s390, LKML, xfs

On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote:
> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
> 
> CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-s390@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-08 14:28   ` Mel Gorman
  0 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2012-10-08 14:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, linux-s390

On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote:
> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
> 
> CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux-s390@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-01 16:26 ` Jan Kara
  (?)
@ 2012-10-09  4:24   ` Hugh Dickins
  -1 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09  4:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman, linux-s390

On Mon, 1 Oct 2012, Jan Kara wrote:

> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().

What changed recently?  Was XFS hardly used on s390 until now?

> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.

Very interesting patch...

> 
> CC: Martin Schwidefsky <schwidefsky@de.ibm.com>

which I'd very much like Martin's opinion on...

> CC: Mel Gorman <mgorman@suse.de>

and I'm grateful to Mel's ack for reawakening me to it...

> CC: linux-s390@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

but I think it's wrong.

> ---
>  mm/rmap.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..6ce8ddb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
>  		struct address_space *mapping = page_mapping(page);
>  		if (mapping) {
>  			ret = page_mkclean_file(mapping, page);
> -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> +			/*
> +			 * We ignore dirty bit for pagecache pages. It is safe
> +			 * as page is marked dirty iff it is writeable (page is
> +			 * marked as dirty when it is made writeable and
> +			 * clear_page_dirty_for_io() writeprotects the page
> +			 * again).
> +			 */
> +			if (PageSwapCache(page) &&
> +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  				ret = 1;

This part you could cut out: page_mkclean() is not used on SwapCache pages.
I believe you are safe to remove the page_test_and_clear_dirty() from here.

>  		}
>  	}
> @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
>  	 * this if the page is anon, so about to be freed; but perhaps
>  	 * not if it's in swapcache - there might be another pte slot
>  	 * containing the swap entry, but page not yet written to swap.
> +	 * For pagecache pages, we don't care about dirty bit in storage
> +	 * key because the page is writeable iff it is dirty (page is marked
> +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> +	 * writeprotects the page again).
>  	 */
> -	if ((!anon || PageSwapCache(page)) &&
> +	if (PageSwapCache(page) &&
>  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  		set_page_dirty(page);

But here's where I think the problem is.  You're assuming that all
filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
there's no such function, just a confusing maze of three) route as XFS.

But filesystems like tmpfs and ramfs (perhaps they're the only two
that matter here) don't participate in that, and wait for an mmap'ed
page to be seen modified by the user (usually via pte_dirty, but that's
a no-op on s390) before page is marked dirty; and page reclaim throws
away undirtied pages.

So, if I'm understanding right, with this change s390 would be in danger
of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
written with the write system call would already be PageDirty and secure.

You mention above that even the kernel writing to the page would mark
the s390 storage key dirty.  I think that means that these shm and
tmpfs and ramfs pages would all have dirty storage keys just from the
clear_highpage() used to prepare them originally, and so would have
been found dirty anyway by the existing code here in page_remove_rmap(),
even though other architectures would regard them as clean and removable.

If that's the case, then maybe we'd do better just to mark them dirty
when faulted in the s390 case.  Then your patch above should (I think)
be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
too (I've not thought through exactly what that patch would be, just
one or two suitably placed SetPageDirtys, I think), and eliminate
page_test_and_clear_dirty() altogether - no tears shed by any of us!

A separate worry came to mind as I thought about your patch: where
in page migration is s390's dirty storage key migrated from old page
to new?  And if there is a problem there, that too should be fixed
by what I propose in the previous paragraph.

Hugh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09  4:24   ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09  4:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-s390, LKML, xfs, linux-mm, Mel Gorman, Martin Schwidefsky

On Mon, 1 Oct 2012, Jan Kara wrote:

> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().

What changed recently?  Was XFS hardly used on s390 until now?

> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.

Very interesting patch...

> 
> CC: Martin Schwidefsky <schwidefsky@de.ibm.com>

which I'd very much like Martin's opinion on...

> CC: Mel Gorman <mgorman@suse.de>

and I'm grateful to Mel's ack for reawakening me to it...

> CC: linux-s390@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

but I think it's wrong.

> ---
>  mm/rmap.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..6ce8ddb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
>  		struct address_space *mapping = page_mapping(page);
>  		if (mapping) {
>  			ret = page_mkclean_file(mapping, page);
> -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> +			/*
> +			 * We ignore dirty bit for pagecache pages. It is safe
> +			 * as page is marked dirty iff it is writeable (page is
> +			 * marked as dirty when it is made writeable and
> +			 * clear_page_dirty_for_io() writeprotects the page
> +			 * again).
> +			 */
> +			if (PageSwapCache(page) &&
> +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  				ret = 1;

This part you could cut out: page_mkclean() is not used on SwapCache pages.
I believe you are safe to remove the page_test_and_clear_dirty() from here.

>  		}
>  	}
> @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
>  	 * this if the page is anon, so about to be freed; but perhaps
>  	 * not if it's in swapcache - there might be another pte slot
>  	 * containing the swap entry, but page not yet written to swap.
> +	 * For pagecache pages, we don't care about dirty bit in storage
> +	 * key because the page is writeable iff it is dirty (page is marked
> +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> +	 * writeprotects the page again).
>  	 */
> -	if ((!anon || PageSwapCache(page)) &&
> +	if (PageSwapCache(page) &&
>  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  		set_page_dirty(page);

But here's where I think the problem is.  You're assuming that all
filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
there's no such function, just a confusing maze of three) route as XFS.

But filesystems like tmpfs and ramfs (perhaps they're the only two
that matter here) don't participate in that, and wait for an mmap'ed
page to be seen modified by the user (usually via pte_dirty, but that's
a no-op on s390) before page is marked dirty; and page reclaim throws
away undirtied pages.

So, if I'm understanding right, with this change s390 would be in danger
of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
written with the write system call would already be PageDirty and secure.

You mention above that even the kernel writing to the page would mark
the s390 storage key dirty.  I think that means that these shm and
tmpfs and ramfs pages would all have dirty storage keys just from the
clear_highpage() used to prepare them originally, and so would have
been found dirty anyway by the existing code here in page_remove_rmap(),
even though other architectures would regard them as clean and removable.

If that's the case, then maybe we'd do better just to mark them dirty
when faulted in the s390 case.  Then your patch above should (I think)
be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
too (I've not thought through exactly what that patch would be, just
one or two suitably placed SetPageDirtys, I think), and eliminate
page_test_and_clear_dirty() altogether - no tears shed by any of us!

A separate worry came to mind as I thought about your patch: where
in page migration is s390's dirty storage key migrated from old page
to new?  And if there is a problem there, that too should be fixed
by what I propose in the previous paragraph.

Hugh

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09  4:24   ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09  4:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman, linux-s390

On Mon, 1 Oct 2012, Jan Kara wrote:

> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().

What changed recently?  Was XFS hardly used on s390 until now?

> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.

Very interesting patch...

> 
> CC: Martin Schwidefsky <schwidefsky@de.ibm.com>

which I'd very much like Martin's opinion on...

> CC: Mel Gorman <mgorman@suse.de>

and I'm grateful to Mel's ack for reawakening me to it...

> CC: linux-s390@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

but I think it's wrong.

> ---
>  mm/rmap.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..6ce8ddb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
>  		struct address_space *mapping = page_mapping(page);
>  		if (mapping) {
>  			ret = page_mkclean_file(mapping, page);
> -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> +			/*
> +			 * We ignore dirty bit for pagecache pages. It is safe
> +			 * as page is marked dirty iff it is writeable (page is
> +			 * marked as dirty when it is made writeable and
> +			 * clear_page_dirty_for_io() writeprotects the page
> +			 * again).
> +			 */
> +			if (PageSwapCache(page) &&
> +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  				ret = 1;

This part you could cut out: page_mkclean() is not used on SwapCache pages.
I believe you are safe to remove the page_test_and_clear_dirty() from here.

>  		}
>  	}
> @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
>  	 * this if the page is anon, so about to be freed; but perhaps
>  	 * not if it's in swapcache - there might be another pte slot
>  	 * containing the swap entry, but page not yet written to swap.
> +	 * For pagecache pages, we don't care about dirty bit in storage
> +	 * key because the page is writeable iff it is dirty (page is marked
> +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> +	 * writeprotects the page again).
>  	 */
> -	if ((!anon || PageSwapCache(page)) &&
> +	if (PageSwapCache(page) &&
>  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  		set_page_dirty(page);

But here's where I think the problem is.  You're assuming that all
filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
there's no such function, just a confusing maze of three) route as XFS.

But filesystems like tmpfs and ramfs (perhaps they're the only two
that matter here) don't participate in that, and wait for an mmap'ed
page to be seen modified by the user (usually via pte_dirty, but that's
a no-op on s390) before page is marked dirty; and page reclaim throws
away undirtied pages.

So, if I'm understanding right, with this change s390 would be in danger
of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
written with the write system call would already be PageDirty and secure.

You mention above that even the kernel writing to the page would mark
the s390 storage key dirty.  I think that means that these shm and
tmpfs and ramfs pages would all have dirty storage keys just from the
clear_highpage() used to prepare them originally, and so would have
been found dirty anyway by the existing code here in page_remove_rmap(),
even though other architectures would regard them as clean and removable.

If that's the case, then maybe we'd do better just to mark them dirty
when faulted in the s390 case.  Then your patch above should (I think)
be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
too (I've not thought through exactly what that patch would be, just
one or two suitably placed SetPageDirtys, I think), and eliminate
page_test_and_clear_dirty() altogether - no tears shed by any of us!

A separate worry came to mind as I thought about your patch: where
in page migration is s390's dirty storage key migrated from old page
to new?  And if there is a problem there, that too should be fixed
by what I propose in the previous paragraph.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09  4:24   ` Hugh Dickins
  (?)
@ 2012-10-09  8:18     ` Martin Schwidefsky
  -1 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-09  8:18 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?

One thing that changed is that the zero_user_segment for the remaining bytes between
i_size and the end of the page has been moved to block_write_full_page_endio, see
git commit eebd2aa355692afa. That changed the timing of the race window in regard
to map/unmap of the page by user space. And yes XFS is in use on s390.
 
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> 
> Very interesting patch...

Yes, it is an interesting idea. I really like the part that we'll use less storage
key operations, as these are freaking expensive.

> > 
> > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> which I'd very much like Martin's opinion on...

Until you pointed out the short-comings of the patch I really liked it ..

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.

Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
future? At least we should add a comment there.

> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
>
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.

The patch relies on the software dirty bit tracking for file backed pages,
if dirty bit tracking is not done for tmpfs and ramfs we are borked.
 
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.

No, the clear_highpage() will set the dirty bit in the storage key but
the SetPageUptodate will clear the complete storage key including the
dirty bit.
 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!

I am seriously tempted to switch to pure software dirty bits by using
page protection for writable but clean pages. The worry is the number of
additional protection faults we would get. But as we do software dirty
bit tracking for the most part anyway this might not be as bad as it
used to be.

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.

That is covered by the SetPageUptodate() in migrate_page_copy().

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09  8:18     ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-09  8:18 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman

On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?

One thing that changed is that the zero_user_segment for the remaining bytes between
i_size and the end of the page has been moved to block_write_full_page_endio, see
git commit eebd2aa355692afa. That changed the timing of the race window in regard
to map/unmap of the page by user space. And yes XFS is in use on s390.
 
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> 
> Very interesting patch...

Yes, it is an interesting idea. I really like the part that we'll use less storage
key operations, as these are freaking expensive.

> > 
> > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> which I'd very much like Martin's opinion on...

Until you pointed out the short-comings of the patch I really liked it ..

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.

Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
future? At least we should add a comment there.

> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
>
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.

The patch relies on the software dirty bit tracking for file backed pages,
if dirty bit tracking is not done for tmpfs and ramfs we are borked.
 
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.

No, the clear_highpage() will set the dirty bit in the storage key but
the SetPageUptodate will clear the complete storage key including the
dirty bit.
 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!

I am seriously tempted to switch to pure software dirty bits by using
page protection for writable but clean pages. The worry is the number of
additional protection faults we would get. But as we do software dirty
bit tracking for the most part anyway this might not be as bad as it
used to be.

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.

That is covered by the SetPageUptodate() in migrate_page_copy().

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09  8:18     ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-09  8:18 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?

One thing that changed is that the zero_user_segment for the remaining bytes between
i_size and the end of the page has been moved to block_write_full_page_endio, see
git commit eebd2aa355692afa. That changed the timing of the race window in regard
to map/unmap of the page by user space. And yes XFS is in use on s390.
 
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> 
> Very interesting patch...

Yes, it is an interesting idea. I really like the part that we'll use less storage
key operations, as these are freaking expensive.

> > 
> > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> which I'd very much like Martin's opinion on...

Until you pointed out the short-comings of the patch I really liked it ..

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.

Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
future? At least we should add a comment there.

> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
>
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.

The patch relies on the software dirty bit tracking for file backed pages,
if dirty bit tracking is not done for tmpfs and ramfs we are borked.
 
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.

No, the clear_highpage() will set the dirty bit in the storage key but
the SetPageUptodate will clear the complete storage key including the
dirty bit.
 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!

I am seriously tempted to switch to pure software dirty bits by using
page protection for writable but clean pages. The worry is the number of
additional protection faults we would get. But as we do software dirty
bit tracking for the most part anyway this might not be as bad as it
used to be.

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.

That is covered by the SetPageUptodate() in migrate_page_copy().

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09  4:24   ` Hugh Dickins
  (?)
@ 2012-10-09  9:32     ` Mel Gorman
  -1 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2012-10-09  9:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, linux-s390

On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > <SNIP>
> > CC: Mel Gorman <mgorman@suse.de>
> 
> and I'm grateful to Mel's ack for reawakening me to it...
> 
> > CC: linux-s390@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> but I think it's wrong.
> 

Dang.

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
> 
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 

In the case of ramfs, what marks the page clean so it could be discarded? It
does not participate in dirty accounting so it's not going to clear the
dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
handler that would use an end_io handler to clear the page after "IO"
completes. I am not seeing how a ramfs page can get discarded at the moment.

shm and tmpfs are indeed different and I did not take them into account
(ba dum tisch) when reviewing. For those pages would it be sufficient to
check the following?

PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

The problem the patch dealt with involved buffers associated with the page
and that shouldn't be a problem for tmpfs, right? I recognise that this
might work just because of co-incidence and set off your "Yuck" detector
and you'll prefer the proposed solution below.

> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
> 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
>  

Do you mean something like this?

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..c66166f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else {
 			inc_mm_counter_fast(mm, MM_FILEPAGES);
 			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
+
+			/*
+			 * s390 depends on the dirty flag from the storage key
+			 * being propagated when the page is unmapped from the
+			 * page tables. For dirty-accounted mapping, we instead
+			 * depend on the page being marked dirty on writes and
+			 * being write-protected on clear_page_dirty_for_io.
+			 * The same protection does not apply for tmpfs pages
+			 * that do not participate in dirty accounting so mark
+			 * them dirty at fault time to avoid the data being
+			 * lost
+			 */
+			if (flags & FAULT_FLAG_WRITE ||
+			    !bdi_cap_account_dirty(page->mapping)) {
 				dirty_page = page;
 				get_page(dirty_page);
 			}

Could something like this result in more writes to swap? Lets say there
is an unmapped tmpfs file with data on it -- a process maps it, reads the
entire mapping and exits. The page is now dirty and potentially will have
to be rewritten to swap. That seems bad. Did I miss your point?

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
> 

hmm, very good question. It should have been checked in
migrate_page_copy() where it could be done under the page lock before
the PageDirty check. Martin?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09  9:32     ` Mel Gorman
  0 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2012-10-09  9:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Martin Schwidefsky

On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > <SNIP>
> > CC: Mel Gorman <mgorman@suse.de>
> 
> and I'm grateful to Mel's ack for reawakening me to it...
> 
> > CC: linux-s390@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> but I think it's wrong.
> 

Dang.

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
> 
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 

In the case of ramfs, what marks the page clean so it could be discarded? It
does not participate in dirty accounting so it's not going to clear the
dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
handler that would use an end_io handler to clear the page after "IO"
completes. I am not seeing how a ramfs page can get discarded at the moment.

shm and tmpfs are indeed different and I did not take them into account
(ba dum tisch) when reviewing. For those pages would it be sufficient to
check the following?

PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

The problem the patch dealt with involved buffers associated with the page
and that shouldn't be a problem for tmpfs, right? I recognise that this
might work just because of co-incidence and set off your "Yuck" detector
and you'll prefer the proposed solution below.

> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
> 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
>  

Do you mean something like this?

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..c66166f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else {
 			inc_mm_counter_fast(mm, MM_FILEPAGES);
 			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
+
+			/*
+			 * s390 depends on the dirty flag from the storage key
+			 * being propagated when the page is unmapped from the
+			 * page tables. For dirty-accounted mapping, we instead
+			 * depend on the page being marked dirty on writes and
+			 * being write-protected on clear_page_dirty_for_io.
+			 * The same protection does not apply for tmpfs pages
+			 * that do not participate in dirty accounting so mark
+			 * them dirty at fault time to avoid the data being
+			 * lost
+			 */
+			if (flags & FAULT_FLAG_WRITE ||
+			    !bdi_cap_account_dirty(page->mapping)) {
 				dirty_page = page;
 				get_page(dirty_page);
 			}

Could something like this result in more writes to swap? Lets say there
is an unmapped tmpfs file with data on it -- a process maps it, reads the
entire mapping and exits. The page is now dirty and potentially will have
to be rewritten to swap. That seems bad. Did I miss your point?

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
> 

hmm, very good question. It should have been checked in
migrate_page_copy() where it could be done under the page lock before
the PageDirty check. Martin?

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09  9:32     ` Mel Gorman
  0 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2012-10-09  9:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, linux-s390

On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > <SNIP>
> > CC: Mel Gorman <mgorman@suse.de>
> 
> and I'm grateful to Mel's ack for reawakening me to it...
> 
> > CC: linux-s390@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> but I think it's wrong.
> 

Dang.

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
> 
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 

In the case of ramfs, what marks the page clean so it could be discarded? It
does not participate in dirty accounting so it's not going to clear the
dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
handler that would use an end_io handler to clear the page after "IO"
completes. I am not seeing how a ramfs page can get discarded at the moment.

shm and tmpfs are indeed different and I did not take them into account
(ba dum tisch) when reviewing. For those pages would it be sufficient to
check the following?

PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

The problem the patch dealt with involved buffers associated with the page
and that shouldn't be a problem for tmpfs, right? I recognise that this
might work just because of co-incidence and set off your "Yuck" detector
and you'll prefer the proposed solution below.

> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
> 
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
>  

Do you mean something like this?

diff --git a/mm/memory.c b/mm/memory.c
index 5736170..c66166f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else {
 			inc_mm_counter_fast(mm, MM_FILEPAGES);
 			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
+
+			/*
+			 * s390 depends on the dirty flag from the storage key
+			 * being propagated when the page is unmapped from the
+			 * page tables. For dirty-accounted mapping, we instead
+			 * depend on the page being marked dirty on writes and
+			 * being write-protected on clear_page_dirty_for_io.
+			 * The same protection does not apply for tmpfs pages
+			 * that do not participate in dirty accounting so mark
+			 * them dirty at fault time to avoid the data being
+			 * lost
+			 */
+			if (flags & FAULT_FLAG_WRITE ||
+			    !bdi_cap_account_dirty(page->mapping)) {
 				dirty_page = page;
 				get_page(dirty_page);
 			}

Could something like this result in more writes to swap? Lets say there
is an unmapped tmpfs file with data on it -- a process maps it, reads the
entire mapping and exits. The page is now dirty and potentially will have
to be rewritten to swap. That seems bad. Did I miss your point?

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
> 

hmm, very good question. It should have been checked in
migrate_page_copy() where it could be done under the page lock before
the PageDirty check. Martin?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09  4:24   ` Hugh Dickins
  (?)
@ 2012-10-09 16:21     ` Jan Kara
  -1 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-09 16:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?
  The problem was originally hit on SLE11-SP2 which is 3.0 based after
migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
XFS just started to be more peevish about what pages it gets between these
two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
up).

> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> 
> Very interesting patch...
  Originally, I even wanted to rip out pte dirty bit handling for shared
file pages but in the end that seemed too bold and unnecessary for my
problem ;)

> > CC: linux-s390@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> but I think it's wrong.
  Thanks for having a look.

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
  OK, will do.

> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
  I admit I haven't thought of tmpfs and similar. After some discussion Mel
pointed me to the code in mmap which makes a difference. So if I get it
right, the difference which causes us problems is that on tmpfs we map the
page writeably even during read-only fault. OK, then if I make the above
code in page_remove_rmap():
	if ((PageSwapCache(page) ||
	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
	    page_test_and_clear_dirty(page_to_pfn(page), 1))
		set_page_dirty(page);

  Things should be ok (modulo the ugliness of this condition), right?

> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
  Yes, except as Martin notes, SetPageUptodate() clears them again so that
doesn't work for us.

> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
  If we want to get rid of page_test_and_clear_dirty() completely (and a
hack in SetPageUptodate()) it should be possible. But we would have to
change mmap to map pages read-only for read-only faults of tmpfs pages at
least on s390 and then somehow fix the SwapCache handling...

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
  I'd think so but I'll let Martin comment on this.

								Honza

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09 16:21     ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-09 16:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman,
	Martin Schwidefsky

On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?
  The problem was originally hit on SLE11-SP2 which is 3.0 based after
migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
XFS just started to be more peevish about what pages it gets between these
two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
up).

> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> 
> Very interesting patch...
  Originally, I even wanted to rip out pte dirty bit handling for shared
file pages but in the end that seemed too bold and unnecessary for my
problem ;)

> > CC: linux-s390@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> but I think it's wrong.
  Thanks for having a look.

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
  OK, will do.

> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
  I admit I haven't thought of tmpfs and similar. After some discussion Mel
pointed me to the code in mmap which makes a difference. So if I get it
right, the difference which causes us problems is that on tmpfs we map the
page writeably even during read-only fault. OK, then if I make the above
code in page_remove_rmap():
	if ((PageSwapCache(page) ||
	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
	    page_test_and_clear_dirty(page_to_pfn(page), 1))
		set_page_dirty(page);

  Things should be ok (modulo the ugliness of this condition), right?

> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
  Yes, except as Martin notes, SetPageUptodate() clears them again so that
doesn't work for us.

> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
  If we want to get rid of page_test_and_clear_dirty() completely (and a
hack in SetPageUptodate()) it should be possible. But we would have to
change mmap to map pages read-only for read-only faults of tmpfs pages at
least on s390 and then somehow fix the SwapCache handling...

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
  I'd think so but I'll let Martin comment on this.

								Honza

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09 16:21     ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-09 16:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> On Mon, 1 Oct 2012, Jan Kara wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> 
> What changed recently?  Was XFS hardly used on s390 until now?
  The problem was originally hit on SLE11-SP2 which is 3.0 based after
migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
XFS just started to be more peevish about what pages it gets between these
two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
up).

> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> 
> Very interesting patch...
  Originally, I even wanted to rip out pte dirty bit handling for shared
file pages but in the end that seemed too bold and unnecessary for my
problem ;)

> > CC: linux-s390@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> but I think it's wrong.
  Thanks for having a look.

> > ---
> >  mm/rmap.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> >  		struct address_space *mapping = page_mapping(page);
> >  		if (mapping) {
> >  			ret = page_mkclean_file(mapping, page);
> > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > +			/*
> > +			 * We ignore dirty bit for pagecache pages. It is safe
> > +			 * as page is marked dirty iff it is writeable (page is
> > +			 * marked as dirty when it is made writeable and
> > +			 * clear_page_dirty_for_io() writeprotects the page
> > +			 * again).
> > +			 */
> > +			if (PageSwapCache(page) &&
> > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  				ret = 1;
> 
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
  OK, will do.

> >  		}
> >  	}
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> >  	 * this if the page is anon, so about to be freed; but perhaps
> >  	 * not if it's in swapcache - there might be another pte slot
> >  	 * containing the swap entry, but page not yet written to swap.
> > +	 * For pagecache pages, we don't care about dirty bit in storage
> > +	 * key because the page is writeable iff it is dirty (page is marked
> > +	 * as dirty when it is made writeable and clear_page_dirty_for_io()
> > +	 * writeprotects the page again).
> >  	 */
> > -	if ((!anon || PageSwapCache(page)) &&
> > +	if (PageSwapCache(page) &&
> >  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> >  		set_page_dirty(page);
> 
> But here's where I think the problem is.  You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
> 
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
  I admit I haven't thought of tmpfs and similar. After some discussion Mel
pointed me to the code in mmap which makes a difference. So if I get it
right, the difference which causes us problems is that on tmpfs we map the
page writeably even during read-only fault. OK, then if I make the above
code in page_remove_rmap():
	if ((PageSwapCache(page) ||
	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
	    page_test_and_clear_dirty(page_to_pfn(page), 1))
		set_page_dirty(page);

  Things should be ok (modulo the ugliness of this condition), right?

> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
> 
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty.  I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
  Yes, except as Martin notes, SetPageUptodate() clears them again so that
doesn't work for us.

> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case.  Then your patch above should (I think)
> be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
  If we want to get rid of page_test_and_clear_dirty() completely (and a
hack in SetPageUptodate()) it should be possible. But we would have to
change mmap to map pages read-only for read-only faults of tmpfs pages at
least on s390 and then somehow fix the SwapCache handling...

> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new?  And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
  I'd think so but I'll let Martin comment on this.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09  9:32     ` Mel Gorman
  (?)
@ 2012-10-09 23:00       ` Hugh Dickins
  -1 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09 23:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, linux-s390

On Tue, 9 Oct 2012, Mel Gorman wrote:
> On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > 
> > So, if I'm understanding right, with this change s390 would be in danger
> > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> > written with the write system call would already be PageDirty and secure.
> > 
> 
> In the case of ramfs, what marks the page clean so it could be discarded? It
> does not participate in dirty accounting so it's not going to clear the
> dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
> handler that would use an end_io handler to clear the page after "IO"
> completes. I am not seeing how a ramfs page can get discarded at the moment.

But we don't have a page clean bit: we have a page dirty bit, and where
is that set in the ramfs read-fault case?  I've not experimented to check,
maybe you're right and ramfs is exempt from the issue.  I thought it was
__do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE.
Ah, you quote almost the very place further down.

> 
> shm and tmpfs are indeed different and I did not take them into account
> (ba dum tisch) when reviewing. For those pages would it be sufficient to
> check the following?
> 
> PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

Something like that, yes: I've a possible patch I'll put in reply to Jan.

> 
> The problem the patch dealt with involved buffers associated with the page
> and that shouldn't be a problem for tmpfs, right?

Right, though I'm now beginning to wonder what the underlying bug is.
It seems to me that we have a bug and an optimization on our hands,
and have rushed into the optimization which would avoid the bug,
without considering what the actual bug is.  More in reply to Jan.

> I recognise that this
> might work just because of co-incidence and set off your "Yuck" detector
> and you'll prefer the proposed solution below.

No, I was mistaken to think that s390 would have dirty pages where
others had clean, Martin has now explained that SetPageUptodate cleans.
I didn't mind continuing an (imagined) inefficiency in s390, but I don't
want to make it more inefficient.

> 
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> > 
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

So that fantasy was all wrong: appealing, but wrong.

> >  
> 
> Do you mean something like this?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..c66166f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		} else {
>  			inc_mm_counter_fast(mm, MM_FILEPAGES);
>  			page_add_file_rmap(page);
> -			if (flags & FAULT_FLAG_WRITE) {
> +
> +			/*
> +			 * s390 depends on the dirty flag from the storage key
> +			 * being propagated when the page is unmapped from the
> +			 * page tables. For dirty-accounted mapping, we instead
> +			 * depend on the page being marked dirty on writes and
> +			 * being write-protected on clear_page_dirty_for_io.
> +			 * The same protection does not apply for tmpfs pages
> +			 * that do not participate in dirty accounting so mark
> +			 * them dirty at fault time to avoid the data being
> +			 * lost
> +			 */
> +			if (flags & FAULT_FLAG_WRITE ||
> +			    !bdi_cap_account_dirty(page->mapping)) {
>  				dirty_page = page;
>  				get_page(dirty_page);
>  			}
> 
> Could something like this result in more writes to swap? Lets say there
> is an unmapped tmpfs file with data on it -- a process maps it, reads the
> entire mapping and exits. The page is now dirty and potentially will have
> to be rewritten to swap. That seems bad. Did I miss your point?

My point was that I mistakenly thought s390 must already be behaving
like that, so wanted it to continue that way, but with cleaner source.

But the CONFIG_S390 in SetPageUptodate makes sure that the zeroed page
starts out storage-key-clean: so you're exactly right, my suggestion
would result in more writes to swap for it, which is not acceptable.

(Plus, having insisted that ramfs is also affected, I went on
to forget that, and was imagining a simple change in mm/shmem.c.)

Hugh

> 
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new?  And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> > 
> 
> hmm, very good question. It should have been checked in
> migrate_page_copy() where it could be done under the page lock before
> the PageDirty check. Martin?
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09 23:00       ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09 23:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Martin Schwidefsky

On Tue, 9 Oct 2012, Mel Gorman wrote:
> On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > 
> > So, if I'm understanding right, with this change s390 would be in danger
> > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> > written with the write system call would already be PageDirty and secure.
> > 
> 
> In the case of ramfs, what marks the page clean so it could be discarded? It
> does not participate in dirty accounting so it's not going to clear the
> dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
> handler that would use an end_io handler to clear the page after "IO"
> completes. I am not seeing how a ramfs page can get discarded at the moment.

But we don't have a page clean bit: we have a page dirty bit, and where
is that set in the ramfs read-fault case?  I've not experimented to check,
maybe you're right and ramfs is exempt from the issue.  I thought it was
__do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE.
Ah, you quote almost the very place further down.

> 
> shm and tmpfs are indeed different and I did not take them into account
> (ba dum tisch) when reviewing. For those pages would it be sufficient to
> check the following?
> 
> PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

Something like that, yes: I've a possible patch I'll put in reply to Jan.

> 
> The problem the patch dealt with involved buffers associated with the page
> and that shouldn't be a problem for tmpfs, right?

Right, though I'm now beginning to wonder what the underlying bug is.
It seems to me that we have a bug and an optimization on our hands,
and have rushed into the optimization which would avoid the bug,
without considering what the actual bug is.  More in reply to Jan.

> I recognise that this
> might work just because of co-incidence and set off your "Yuck" detector
> and you'll prefer the proposed solution below.

No, I was mistaken to think that s390 would have dirty pages where
others had clean, Martin has now explained that SetPageUptodate cleans.
I didn't mind continuing an (imagined) inefficiency in s390, but I don't
want to make it more inefficient.

> 
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> > 
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

So that fantasy was all wrong: appealing, but wrong.

> >  
> 
> Do you mean something like this?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..c66166f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		} else {
>  			inc_mm_counter_fast(mm, MM_FILEPAGES);
>  			page_add_file_rmap(page);
> -			if (flags & FAULT_FLAG_WRITE) {
> +
> +			/*
> +			 * s390 depends on the dirty flag from the storage key
> +			 * being propagated when the page is unmapped from the
> +			 * page tables. For dirty-accounted mapping, we instead
> +			 * depend on the page being marked dirty on writes and
> +			 * being write-protected on clear_page_dirty_for_io.
> +			 * The same protection does not apply for tmpfs pages
> +			 * that do not participate in dirty accounting so mark
> +			 * them dirty at fault time to avoid the data being
> +			 * lost
> +			 */
> +			if (flags & FAULT_FLAG_WRITE ||
> +			    !bdi_cap_account_dirty(page->mapping)) {
>  				dirty_page = page;
>  				get_page(dirty_page);
>  			}
> 
> Could something like this result in more writes to swap? Lets say there
> is an unmapped tmpfs file with data on it -- a process maps it, reads the
> entire mapping and exits. The page is now dirty and potentially will have
> to be rewritten to swap. That seems bad. Did I miss your point?

My point was that I mistakenly thought s390 must already be behaving
like that, so wanted it to continue that way, but with cleaner source.

But the CONFIG_S390 in SetPageUptodate makes sure that the zeroed page
starts out storage-key-clean: so you're exactly right, my suggestion
would result in more writes to swap for it, which is not acceptable.

(Plus, having insisted that ramfs is also affected, I went on
to forget that, and was imagining a simple change in mm/shmem.c.)

Hugh

> 
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new?  And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> > 
> 
> hmm, very good question. It should have been checked in
> migrate_page_copy() where it could be done under the page lock before
> the PageDirty check. Martin?
> 
> -- 
> Mel Gorman
> SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09 23:00       ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09 23:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, linux-s390

On Tue, 9 Oct 2012, Mel Gorman wrote:
> On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > 
> > So, if I'm understanding right, with this change s390 would be in danger
> > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> > written with the write system call would already be PageDirty and secure.
> > 
> 
> In the case of ramfs, what marks the page clean so it could be discarded? It
> does not participate in dirty accounting so it's not going to clear the
> dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
> handler that would use an end_io handler to clear the page after "IO"
> completes. I am not seeing how a ramfs page can get discarded at the moment.

But we don't have a page clean bit: we have a page dirty bit, and where
is that set in the ramfs read-fault case?  I've not experimented to check,
maybe you're right and ramfs is exempt from the issue.  I thought it was
__do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE.
Ah, you quote almost the very place further down.

> 
> shm and tmpfs are indeed different and I did not take them into account
> (ba dum tisch) when reviewing. For those pages would it be sufficient to
> check the following?
> 
> PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)

Something like that, yes: I've a possible patch I'll put in reply to Jan.

> 
> The problem the patch dealt with involved buffers associated with the page
> and that shouldn't be a problem for tmpfs, right?

Right, though I'm now beginning to wonder what the underlying bug is.
It seems to me that we have a bug and an optimization on our hands,
and have rushed into the optimization which would avoid the bug,
without considering what the actual bug is.  More in reply to Jan.

> I recognise that this
> might work just because of co-incidence and set off your "Yuck" detector
> and you'll prefer the proposed solution below.

No, I was mistaken to think that s390 would have dirty pages where
others had clean, Martin has now explained that SetPageUptodate cleans.
I didn't mind continuing an (imagined) inefficiency in s390, but I don't
want to make it more inefficient.

> 
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> > 
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

So that fantasy was all wrong: appealing, but wrong.

> >  
> 
> Do you mean something like this?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..c66166f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		} else {
>  			inc_mm_counter_fast(mm, MM_FILEPAGES);
>  			page_add_file_rmap(page);
> -			if (flags & FAULT_FLAG_WRITE) {
> +
> +			/*
> +			 * s390 depends on the dirty flag from the storage key
> +			 * being propagated when the page is unmapped from the
> +			 * page tables. For dirty-accounted mapping, we instead
> +			 * depend on the page being marked dirty on writes and
> +			 * being write-protected on clear_page_dirty_for_io.
> +			 * The same protection does not apply for tmpfs pages
> +			 * that do not participate in dirty accounting so mark
> +			 * them dirty at fault time to avoid the data being
> +			 * lost
> +			 */
> +			if (flags & FAULT_FLAG_WRITE ||
> +			    !bdi_cap_account_dirty(page->mapping)) {
>  				dirty_page = page;
>  				get_page(dirty_page);
>  			}
> 
> Could something like this result in more writes to swap? Lets say there
> is an unmapped tmpfs file with data on it -- a process maps it, reads the
> entire mapping and exits. The page is now dirty and potentially will have
> to be rewritten to swap. That seems bad. Did I miss your point?

My point was that I mistakenly thought s390 must already be behaving
like that, so wanted it to continue that way, but with cleaner source.

But the CONFIG_S390 in SetPageUptodate makes sure that the zeroed page
starts out storage-key-clean: so you're exactly right, my suggestion
would result in more writes to swap for it, which is not acceptable.

(Plus, having insisted that ramfs is also affected, I went on
to forget that, and was imagining a simple change in mm/shmem.c.)

Hugh

> 
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new?  And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> > 
> 
> hmm, very good question. It should have been checked in
> migrate_page_copy() where it could be done under the page lock before
> the PageDirty check. Martin?
> 
> -- 
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09  8:18     ` Martin Schwidefsky
  (?)
@ 2012-10-09 23:21       ` Hugh Dickins
  -1 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09 23:21 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
> 
> One thing that changed is that the zero_user_segment for the remaining bytes between
> i_size and the end of the page has been moved to block_write_full_page_endio, see
> git commit eebd2aa355692afa. That changed the timing of the race window in regard
> to map/unmap of the page by user space. And yes XFS is in use on s390.

February 2008: I think we have different ideas of "recently" ;)

>  
> > > 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.
> > > 
> > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > > if and only if it is dirty.
> > 
> > Very interesting patch...
> 
> Yes, it is an interesting idea. I really like the part that we'll use less storage
> key operations, as these are freaking expensive.

As I said to Mel and will repeat to Jan, though an optimization would
be nice, I don't think we should necessarily mix it with the bugfix.

> 
> > > 
> > > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > 
> > which I'd very much like Martin's opinion on...
> 
> Until you pointed out the short-comings of the patch I really liked it ..
> 
> > > ---
> > >  mm/rmap.c |   16 ++++++++++++++--
> > >  1 files changed, 14 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 0f3b7cd..6ce8ddb 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > >  		struct address_space *mapping = page_mapping(page);
> > >  		if (mapping) {
> > >  			ret = page_mkclean_file(mapping, page);
> > > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > > +			/*
> > > +			 * We ignore dirty bit for pagecache pages. It is safe
> > > +			 * as page is marked dirty iff it is writeable (page is
> > > +			 * marked as dirty when it is made writeable and
> > > +			 * clear_page_dirty_for_io() writeprotects the page
> > > +			 * again).
> > > +			 */
> > > +			if (PageSwapCache(page) &&
> > > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > >  				ret = 1;
> > 
> > This part you could cut out: page_mkclean() is not used on SwapCache pages.
> > I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
> future? At least we should add a comment there.

I set out to do so, to add a comment there; but honestly, it's a strange
place for such a comment when there's no longer even the code to comment
upon.  And page_mkclean_file(), called in the line above, already says
BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change
that sends PageSwapCache pages this way.  It is possible that one day we
shall want to send tmpfs and swapcache down this route, I'm not ruling
that out; but then we shall have to extend page_mkclean(), yes.

> 
> The patch relies on the software dirty bit tracking for file backed pages,
> if dirty bit tracking is not done for tmpfs and ramfs we are borked.
>  
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> 
> No, the clear_highpage() will set the dirty bit in the storage key but
> the SetPageUptodate will clear the complete storage key including the
> dirty bit.

Ah, thank you Martin, that clears that up...

>  
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

... so I should not hurt your performance with a change of that kind.

> 
> I am seriously tempted to switch to pure software dirty bits by using
> page protection for writable but clean pages. The worry is the number of
> additional protection faults we would get. But as we do software dirty
> bit tracking for the most part anyway this might not be as bad as it
> used to be.

That's exactly the same reason why tmpfs opts out of dirty tracking, fear
of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
anomalous too, and I'd be a hypocrite to push for you to make that change.

> 
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new?  And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> 
> That is covered by the SetPageUptodate() in migrate_page_copy().,> 

I don't think so: that makes sure that the newpage is not marked
dirty in storage key just because of the copy_highpage to it; but
I see nothing to mark the newpage dirty in storage key when the
old page was dirty there.

Hugh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09 23:21       ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09 23:21 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman

On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
> 
> One thing that changed is that the zero_user_segment for the remaining bytes between
> i_size and the end of the page has been moved to block_write_full_page_endio, see
> git commit eebd2aa355692afa. That changed the timing of the race window in regard
> to map/unmap of the page by user space. And yes XFS is in use on s390.

February 2008: I think we have different ideas of "recently" ;)

>  
> > > 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.
> > > 
> > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > > if and only if it is dirty.
> > 
> > Very interesting patch...
> 
> Yes, it is an interesting idea. I really like the part that we'll use less storage
> key operations, as these are freaking expensive.

As I said to Mel and will repeat to Jan, though an optimization would
be nice, I don't think we should necessarily mix it with the bugfix.

> 
> > > 
> > > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > 
> > which I'd very much like Martin's opinion on...
> 
> Until you pointed out the short-comings of the patch I really liked it ..
> 
> > > ---
> > >  mm/rmap.c |   16 ++++++++++++++--
> > >  1 files changed, 14 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 0f3b7cd..6ce8ddb 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > >  		struct address_space *mapping = page_mapping(page);
> > >  		if (mapping) {
> > >  			ret = page_mkclean_file(mapping, page);
> > > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > > +			/*
> > > +			 * We ignore dirty bit for pagecache pages. It is safe
> > > +			 * as page is marked dirty iff it is writeable (page is
> > > +			 * marked as dirty when it is made writeable and
> > > +			 * clear_page_dirty_for_io() writeprotects the page
> > > +			 * again).
> > > +			 */
> > > +			if (PageSwapCache(page) &&
> > > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > >  				ret = 1;
> > 
> > This part you could cut out: page_mkclean() is not used on SwapCache pages.
> > I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
> future? At least we should add a comment there.

I set out to do so, to add a comment there; but honestly, it's a strange
place for such a comment when there's no longer even the code to comment
upon.  And page_mkclean_file(), called in the line above, already says
BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change
that sends PageSwapCache pages this way.  It is possible that one day we
shall want to send tmpfs and swapcache down this route, I'm not ruling
that out; but then we shall have to extend page_mkclean(), yes.

> 
> The patch relies on the software dirty bit tracking for file backed pages,
> if dirty bit tracking is not done for tmpfs and ramfs we are borked.
>  
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> 
> No, the clear_highpage() will set the dirty bit in the storage key but
> the SetPageUptodate will clear the complete storage key including the
> dirty bit.

Ah, thank you Martin, that clears that up...

>  
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

... so I should not hurt your performance with a change of that kind.

> 
> I am seriously tempted to switch to pure software dirty bits by using
> page protection for writable but clean pages. The worry is the number of
> additional protection faults we would get. But as we do software dirty
> bit tracking for the most part anyway this might not be as bad as it
> used to be.

That's exactly the same reason why tmpfs opts out of dirty tracking, fear
of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
anomalous too, and I'd be a hypocrite to push for you to make that change.

> 
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new?  And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> 
> That is covered by the SetPageUptodate() in migrate_page_copy().,> 

I don't think so: that makes sure that the newpage is not marked
dirty in storage key just because of the copy_highpage to it; but
I see nothing to mark the newpage dirty in storage key when the
old page was dirty there.

Hugh

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-09 23:21       ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-09 23:21 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
> 
> One thing that changed is that the zero_user_segment for the remaining bytes between
> i_size and the end of the page has been moved to block_write_full_page_endio, see
> git commit eebd2aa355692afa. That changed the timing of the race window in regard
> to map/unmap of the page by user space. And yes XFS is in use on s390.

February 2008: I think we have different ideas of "recently" ;)

>  
> > > 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.
> > > 
> > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > > if and only if it is dirty.
> > 
> > Very interesting patch...
> 
> Yes, it is an interesting idea. I really like the part that we'll use less storage
> key operations, as these are freaking expensive.

As I said to Mel and will repeat to Jan, though an optimization would
be nice, I don't think we should necessarily mix it with the bugfix.

> 
> > > 
> > > CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > 
> > which I'd very much like Martin's opinion on...
> 
> Until you pointed out the short-comings of the patch I really liked it ..
> 
> > > ---
> > >  mm/rmap.c |   16 ++++++++++++++--
> > >  1 files changed, 14 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 0f3b7cd..6ce8ddb 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > >  		struct address_space *mapping = page_mapping(page);
> > >  		if (mapping) {
> > >  			ret = page_mkclean_file(mapping, page);
> > > -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > > +			/*
> > > +			 * We ignore dirty bit for pagecache pages. It is safe
> > > +			 * as page is marked dirty iff it is writeable (page is
> > > +			 * marked as dirty when it is made writeable and
> > > +			 * clear_page_dirty_for_io() writeprotects the page
> > > +			 * again).
> > > +			 */
> > > +			if (PageSwapCache(page) &&
> > > +			    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > >  				ret = 1;
> > 
> > This part you could cut out: page_mkclean() is not used on SwapCache pages.
> > I believe you are safe to remove the page_test_and_clear_dirty() from here.
> 
> Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
> future? At least we should add a comment there.

I set out to do so, to add a comment there; but honestly, it's a strange
place for such a comment when there's no longer even the code to comment
upon.  And page_mkclean_file(), called in the line above, already says
BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change
that sends PageSwapCache pages this way.  It is possible that one day we
shall want to send tmpfs and swapcache down this route, I'm not ruling
that out; but then we shall have to extend page_mkclean(), yes.

> 
> The patch relies on the software dirty bit tracking for file backed pages,
> if dirty bit tracking is not done for tmpfs and ramfs we are borked.
>  
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty.  I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> 
> No, the clear_highpage() will set the dirty bit in the storage key but
> the SetPageUptodate will clear the complete storage key including the
> dirty bit.

Ah, thank you Martin, that clears that up...

>  
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case.  Then your patch above should (I think)
> > be safe.  Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!

... so I should not hurt your performance with a change of that kind.

> 
> I am seriously tempted to switch to pure software dirty bits by using
> page protection for writable but clean pages. The worry is the number of
> additional protection faults we would get. But as we do software dirty
> bit tracking for the most part anyway this might not be as bad as it
> used to be.

That's exactly the same reason why tmpfs opts out of dirty tracking, fear
of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
anomalous too, and I'd be a hypocrite to push for you to make that change.

> 
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new?  And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> 
> That is covered by the SetPageUptodate() in migrate_page_copy().,> 

I don't think so: that makes sure that the newpage is not marked
dirty in storage key just because of the copy_highpage to it; but
I see nothing to mark the newpage dirty in storage key when the
old page was dirty there.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09 16:21     ` Jan Kara
  (?)
@ 2012-10-10  2:19       ` Hugh Dickins
  -1 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10  2:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman, linux-s390

On Tue, 9 Oct 2012, Jan Kara wrote:
> On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
>   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> XFS just started to be more peevish about what pages it gets between these
> two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> up).

Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
ASSERT usually compiled out, stumbling later in page_buffers() as you say.

> 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.

Similar problem, or is that the whole of the problem?  Where else does
the page get written to, after clearing page dirty?  (It may not be worth
spending time to answer me, I feel I'm wasting too much time on this.)

I keep trying to put my finger on the precise bug.  I said in earlier
mails to Mel and to Martin that we're mixing a bugfix and an optimization,
but I cannot quite point to the bug.  Could one say that it's precisely at
the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
that the storage key ought to be re-cleaned after that?

What if one day I happened to copy that code into shmem_writepage()?
I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
never write shmem to swap while it's still mapped, so it wouldn't even
have a chance to redirty the page in page_remove_rmap().

I guess I'm worrying too much; but it's not crystal clear to me why any
!mapping_cap_account_dirty mapping would necessarily not have the problem.

> > But here's where I think the problem is.  You're assuming that all
> > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > there's no such function, just a confusing maze of three) route as XFS.
> > 
> > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > that matter here) don't participate in that, and wait for an mmap'ed
> > page to be seen modified by the user (usually via pte_dirty, but that's
> > a no-op on s390) before page is marked dirty; and page reclaim throws
> > away undirtied pages.
>   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> pointed me to the code in mmap which makes a difference. So if I get it
> right, the difference which causes us problems is that on tmpfs we map the
> page writeably even during read-only fault. OK, then if I make the above
> code in page_remove_rmap():
> 	if ((PageSwapCache(page) ||
> 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> 		set_page_dirty(page);
> 
>   Things should be ok (modulo the ugliness of this condition), right?

(Setting aside my reservations above...) That's almost exactly right, but
I think the issue of a racing truncation (which could reset page->mapping
to NULL at any moment) means we have to be a bit more careful.  Usually
we guard against that with page lock, but here we can rely on mapcount.

page_mapping(page), with its built-in PageSwapCache check, actually ends
up making the condition look less ugly; and so far as I could tell,
the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
when we are left with its VM_BUG_ON(PageSlab(page))).

But please look this over very critically and test (and if you like it,
please adopt it as your own): I'm not entirely convinced yet myself.

(One day, I do want to move that block further down page_remove_rmap(),
outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think
there's an actual problem at present in calling set_page_dirty() there,
but I have seen patches which could give it a lock-ordering issue, so
better to untangle them.  No reason to muddle that in with your fix,
but I thought I'd mention it while we're all staring at this.)

Hugh

---

 mm/rmap.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

--- 3.6.0+/mm/rmap.c	2012-10-09 14:01:12.356379322 -0700
+++ linux/mm/rmap.c	2012-10-09 14:58:48.160445605 -0700
@@ -56,6 +56,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
+#include <linux/backing-dev.h>

 #include <asm/tlbflush.h>

@@ -926,11 +927,8 @@ int page_mkclean(struct page *page)

 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
-		if (mapping) {
+		if (mapping)
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
-				ret = 1;
-		}
 	}

 	return ret;
@@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
+	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 *
+	 * And we can skip it on file pages, so long as the filesystem
+	 * participates in dirty tracking; but need to catch shm and tmpfs
+	 * and ramfs pages which have been modified since creation by read
+	 * fault.
+	 *
+	 * Note that mapping must be decided above, before decrementing
+	 * mapcount (which luckily provides a barrier): once page is unmapped,
+	 * it could be truncated and page->mapping reset to NULL at any moment.
+	 * Note also that we are relying on page_mapping(page) to set mapping
+	 * to &swapper_space when PageSwapCache(page).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (mapping && !mapping_cap_account_dirty(mapping) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10  2:19       ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10  2:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-s390, LKML, xfs, linux-mm, Mel Gorman, Martin Schwidefsky

On Tue, 9 Oct 2012, Jan Kara wrote:
> On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
>   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> XFS just started to be more peevish about what pages it gets between these
> two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> up).

Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
ASSERT usually compiled out, stumbling later in page_buffers() as you say.

> 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.

Similar problem, or is that the whole of the problem?  Where else does
the page get written to, after clearing page dirty?  (It may not be worth
spending time to answer me, I feel I'm wasting too much time on this.)

I keep trying to put my finger on the precise bug.  I said in earlier
mails to Mel and to Martin that we're mixing a bugfix and an optimization,
but I cannot quite point to the bug.  Could one say that it's precisely at
the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
that the storage key ought to be re-cleaned after that?

What if one day I happened to copy that code into shmem_writepage()?
I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
never write shmem to swap while it's still mapped, so it wouldn't even
have a chance to redirty the page in page_remove_rmap().

I guess I'm worrying too much; but it's not crystal clear to me why any
!mapping_cap_account_dirty mapping would necessarily not have the problem.

> > But here's where I think the problem is.  You're assuming that all
> > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > there's no such function, just a confusing maze of three) route as XFS.
> > 
> > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > that matter here) don't participate in that, and wait for an mmap'ed
> > page to be seen modified by the user (usually via pte_dirty, but that's
> > a no-op on s390) before page is marked dirty; and page reclaim throws
> > away undirtied pages.
>   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> pointed me to the code in mmap which makes a difference. So if I get it
> right, the difference which causes us problems is that on tmpfs we map the
> page writeably even during read-only fault. OK, then if I make the above
> code in page_remove_rmap():
> 	if ((PageSwapCache(page) ||
> 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> 		set_page_dirty(page);
> 
>   Things should be ok (modulo the ugliness of this condition), right?

(Setting aside my reservations above...) That's almost exactly right, but
I think the issue of a racing truncation (which could reset page->mapping
to NULL at any moment) means we have to be a bit more careful.  Usually
we guard against that with page lock, but here we can rely on mapcount.

page_mapping(page), with its built-in PageSwapCache check, actually ends
up making the condition look less ugly; and so far as I could tell,
the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
when we are left with its VM_BUG_ON(PageSlab(page))).

But please look this over very critically and test (and if you like it,
please adopt it as your own): I'm not entirely convinced yet myself.

(One day, I do want to move that block further down page_remove_rmap(),
outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think
there's an actual problem at present in calling set_page_dirty() there,
but I have seen patches which could give it a lock-ordering issue, so
better to untangle them.  No reason to muddle that in with your fix,
but I thought I'd mention it while we're all staring at this.)

Hugh

---

 mm/rmap.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

--- 3.6.0+/mm/rmap.c	2012-10-09 14:01:12.356379322 -0700
+++ linux/mm/rmap.c	2012-10-09 14:58:48.160445605 -0700
@@ -56,6 +56,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
+#include <linux/backing-dev.h>

 #include <asm/tlbflush.h>

@@ -926,11 +927,8 @@ int page_mkclean(struct page *page)

 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
-		if (mapping) {
+		if (mapping)
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
-				ret = 1;
-		}
 	}

 	return ret;
@@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
+	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 *
+	 * And we can skip it on file pages, so long as the filesystem
+	 * participates in dirty tracking; but need to catch shm and tmpfs
+	 * and ramfs pages which have been modified since creation by read
+	 * fault.
+	 *
+	 * Note that mapping must be decided above, before decrementing
+	 * mapcount (which luckily provides a barrier): once page is unmapped,
+	 * it could be truncated and page->mapping reset to NULL at any moment.
+	 * Note also that we are relying on page_mapping(page) to set mapping
+	 * to &swapper_space when PageSwapCache(page).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (mapping && !mapping_cap_account_dirty(mapping) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10  2:19       ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10  2:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman, linux-s390

On Tue, 9 Oct 2012, Jan Kara wrote:
> On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> > 
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > > 
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> > 
> > What changed recently?  Was XFS hardly used on s390 until now?
>   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> XFS just started to be more peevish about what pages it gets between these
> two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> up).

Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
ASSERT usually compiled out, stumbling later in page_buffers() as you say.

> 
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.

Similar problem, or is that the whole of the problem?  Where else does
the page get written to, after clearing page dirty?  (It may not be worth
spending time to answer me, I feel I'm wasting too much time on this.)

I keep trying to put my finger on the precise bug.  I said in earlier
mails to Mel and to Martin that we're mixing a bugfix and an optimization,
but I cannot quite point to the bug.  Could one say that it's precisely at
the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
that the storage key ought to be re-cleaned after that?

What if one day I happened to copy that code into shmem_writepage()?
I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
never write shmem to swap while it's still mapped, so it wouldn't even
have a chance to redirty the page in page_remove_rmap().

I guess I'm worrying too much; but it's not crystal clear to me why any
!mapping_cap_account_dirty mapping would necessarily not have the problem.

> > But here's where I think the problem is.  You're assuming that all
> > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > there's no such function, just a confusing maze of three) route as XFS.
> > 
> > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > that matter here) don't participate in that, and wait for an mmap'ed
> > page to be seen modified by the user (usually via pte_dirty, but that's
> > a no-op on s390) before page is marked dirty; and page reclaim throws
> > away undirtied pages.
>   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> pointed me to the code in mmap which makes a difference. So if I get it
> right, the difference which causes us problems is that on tmpfs we map the
> page writeably even during read-only fault. OK, then if I make the above
> code in page_remove_rmap():
> 	if ((PageSwapCache(page) ||
> 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> 		set_page_dirty(page);
> 
>   Things should be ok (modulo the ugliness of this condition), right?

(Setting aside my reservations above...) That's almost exactly right, but
I think the issue of a racing truncation (which could reset page->mapping
to NULL at any moment) means we have to be a bit more careful.  Usually
we guard against that with page lock, but here we can rely on mapcount.

page_mapping(page), with its built-in PageSwapCache check, actually ends
up making the condition look less ugly; and so far as I could tell,
the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
when we are left with its VM_BUG_ON(PageSlab(page))).

But please look this over very critically and test (and if you like it,
please adopt it as your own): I'm not entirely convinced yet myself.

(One day, I do want to move that block further down page_remove_rmap(),
outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think
there's an actual problem at present in calling set_page_dirty() there,
but I have seen patches which could give it a lock-ordering issue, so
better to untangle them.  No reason to muddle that in with your fix,
but I thought I'd mention it while we're all staring at this.)

Hugh

---

 mm/rmap.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

--- 3.6.0+/mm/rmap.c	2012-10-09 14:01:12.356379322 -0700
+++ linux/mm/rmap.c	2012-10-09 14:58:48.160445605 -0700
@@ -56,6 +56,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
+#include <linux/backing-dev.h>

 #include <asm/tlbflush.h>

@@ -926,11 +927,8 @@ int page_mkclean(struct page *page)

 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
-		if (mapping) {
+		if (mapping)
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
-				ret = 1;
-		}
 	}

 	return ret;
@@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
+	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 *
+	 * And we can skip it on file pages, so long as the filesystem
+	 * participates in dirty tracking; but need to catch shm and tmpfs
+	 * and ramfs pages which have been modified since creation by read
+	 * fault.
+	 *
+	 * Note that mapping must be decided above, before decrementing
+	 * mapcount (which luckily provides a barrier): once page is unmapped,
+	 * it could be truncated and page->mapping reset to NULL at any moment.
+	 * Note also that we are relying on page_mapping(page) to set mapping
+	 * to &swapper_space when PageSwapCache(page).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (mapping && !mapping_cap_account_dirty(mapping) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-10  2:19       ` Hugh Dickins
  (?)
@ 2012-10-10  8:55         ` Jan Kara
  -1 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-10  8:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
...
> > > > Similar problem can also happen when zero_user_segment() call from
> > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > unmapped.
> 
> Similar problem, or is that the whole of the problem?  Where else does
> the page get written to, after clearing page dirty?  (It may not be worth
> spending time to answer me, I feel I'm wasting too much time on this.)
  I think the devil is in "after clearing page dirty" -
clear_page_dirty_for_io() has an optimization that it does not bother
transfering pte or storage key dirty bits to page dirty bit when page is
not mapped. On s390 that results in storage key dirty bit set once buffered
write modifies the page.

BTW there's no other place I'm aware of (and I was looking for some time
before I realized that storage key could remain set from buffered write as
described above).
> 
> I keep trying to put my finger on the precise bug.  I said in earlier
> mails to Mel and to Martin that we're mixing a bugfix and an optimization,
> but I cannot quite point to the bug.  Could one say that it's precisely at
> the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
> that the storage key ought to be re-cleaned after that?
  I think the precise bug is that we can leave dirty bit in storage key set
after writes from kernel while some parts of kernel assume the bit can be
set only via user mapping.

In a perfect world with infinite computation resources, all writes to
pages from kernel could look like:
	.. assume locked page ..
	page_mkclean(page);
	if (page_test_and_clear_dirty(page))
		set_page_dirty(page);
	write to page
	page_test_and_clear_dirty(page);	/* Clean storage key */

This would be bulletproof ... and ridiculously expensive.

> What if one day I happened to copy that code into shmem_writepage()?
> I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
> never write shmem to swap while it's still mapped, so it wouldn't even
> have a chance to redirty the page in page_remove_rmap().
> 
> I guess I'm worrying too much; but it's not crystal clear to me why any
> !mapping_cap_account_dirty mapping would necessarily not have the problem.
  They can have a problem - if they cared that page_remove_rmap() can mark
as dirty a page which was never written to via mmap. So far we are lucky
and all !mapping_cap_account_dirty users don't care.

> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > 	if ((PageSwapCache(page) ||
> > 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > 		set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  OK, I'll push the kernel with your updated patch to our build machines
and let it run there for a few days (it took about a day to reproduce the
issue originally). Thanks a lot for helping me with this.

								Honza


>  mm/rmap.c |   20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> --- 3.6.0+/mm/rmap.c	2012-10-09 14:01:12.356379322 -0700
> +++ linux/mm/rmap.c	2012-10-09 14:58:48.160445605 -0700
> @@ -56,6 +56,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
>  #include <linux/hugetlb.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -926,11 +927,8 @@ int page_mkclean(struct page *page)
>  
>  	if (page_mapped(page)) {
>  		struct address_space *mapping = page_mapping(page);
> -		if (mapping) {
> +		if (mapping)
>  			ret = page_mkclean_file(mapping, page);
> -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> -				ret = 1;
> -		}
>  	}
>  
>  	return ret;
> @@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
>   */
>  void page_remove_rmap(struct page *page)
>  {
> +	struct address_space *mapping = page_mapping(page);
>  	bool anon = PageAnon(page);
>  	bool locked;
>  	unsigned long flags;
> @@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
>  	 * this if the page is anon, so about to be freed; but perhaps
>  	 * not if it's in swapcache - there might be another pte slot
>  	 * containing the swap entry, but page not yet written to swap.
> +	 *
> +	 * And we can skip it on file pages, so long as the filesystem
> +	 * participates in dirty tracking; but need to catch shm and tmpfs
> +	 * and ramfs pages which have been modified since creation by read
> +	 * fault.
> +	 *
> +	 * Note that mapping must be decided above, before decrementing
> +	 * mapcount (which luckily provides a barrier): once page is unmapped,
> +	 * it could be truncated and page->mapping reset to NULL at any moment.
> +	 * Note also that we are relying on page_mapping(page) to set mapping
> +	 * to &swapper_space when PageSwapCache(page).
>  	 */
> -	if ((!anon || PageSwapCache(page)) &&
> +	if (mapping && !mapping_cap_account_dirty(mapping) &&
>  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  		set_page_dirty(page);
>  	/*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10  8:55         ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-10  8:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman,
	Martin Schwidefsky

On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
...
> > > > Similar problem can also happen when zero_user_segment() call from
> > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > unmapped.
> 
> Similar problem, or is that the whole of the problem?  Where else does
> the page get written to, after clearing page dirty?  (It may not be worth
> spending time to answer me, I feel I'm wasting too much time on this.)
  I think the devil is in "after clearing page dirty" -
clear_page_dirty_for_io() has an optimization that it does not bother
transfering pte or storage key dirty bits to page dirty bit when page is
not mapped. On s390 that results in storage key dirty bit set once buffered
write modifies the page.

BTW there's no other place I'm aware of (and I was looking for some time
before I realized that storage key could remain set from buffered write as
described above).
> 
> I keep trying to put my finger on the precise bug.  I said in earlier
> mails to Mel and to Martin that we're mixing a bugfix and an optimization,
> but I cannot quite point to the bug.  Could one say that it's precisely at
> the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
> that the storage key ought to be re-cleaned after that?
  I think the precise bug is that we can leave dirty bit in storage key set
after writes from kernel while some parts of kernel assume the bit can be
set only via user mapping.

In a perfect world with infinite computation resources, all writes to
pages from kernel could look like:
	.. assume locked page ..
	page_mkclean(page);
	if (page_test_and_clear_dirty(page))
		set_page_dirty(page);
	write to page
	page_test_and_clear_dirty(page);	/* Clean storage key */

This would be bulletproof ... and ridiculously expensive.

> What if one day I happened to copy that code into shmem_writepage()?
> I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
> never write shmem to swap while it's still mapped, so it wouldn't even
> have a chance to redirty the page in page_remove_rmap().
> 
> I guess I'm worrying too much; but it's not crystal clear to me why any
> !mapping_cap_account_dirty mapping would necessarily not have the problem.
  They can have a problem - if they cared that page_remove_rmap() can mark
as dirty a page which was never written to via mmap. So far we are lucky
and all !mapping_cap_account_dirty users don't care.

> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > 	if ((PageSwapCache(page) ||
> > 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > 		set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  OK, I'll push the kernel with your updated patch to our build machines
and let it run there for a few days (it took about a day to reproduce the
issue originally). Thanks a lot for helping me with this.

								Honza


>  mm/rmap.c |   20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> --- 3.6.0+/mm/rmap.c	2012-10-09 14:01:12.356379322 -0700
> +++ linux/mm/rmap.c	2012-10-09 14:58:48.160445605 -0700
> @@ -56,6 +56,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
>  #include <linux/hugetlb.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -926,11 +927,8 @@ int page_mkclean(struct page *page)
>  
>  	if (page_mapped(page)) {
>  		struct address_space *mapping = page_mapping(page);
> -		if (mapping) {
> +		if (mapping)
>  			ret = page_mkclean_file(mapping, page);
> -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> -				ret = 1;
> -		}
>  	}
>  
>  	return ret;
> @@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
>   */
>  void page_remove_rmap(struct page *page)
>  {
> +	struct address_space *mapping = page_mapping(page);
>  	bool anon = PageAnon(page);
>  	bool locked;
>  	unsigned long flags;
> @@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
>  	 * this if the page is anon, so about to be freed; but perhaps
>  	 * not if it's in swapcache - there might be another pte slot
>  	 * containing the swap entry, but page not yet written to swap.
> +	 *
> +	 * And we can skip it on file pages, so long as the filesystem
> +	 * participates in dirty tracking; but need to catch shm and tmpfs
> +	 * and ramfs pages which have been modified since creation by read
> +	 * fault.
> +	 *
> +	 * Note that mapping must be decided above, before decrementing
> +	 * mapcount (which luckily provides a barrier): once page is unmapped,
> +	 * it could be truncated and page->mapping reset to NULL at any moment.
> +	 * Note also that we are relying on page_mapping(page) to set mapping
> +	 * to &swapper_space when PageSwapCache(page).
>  	 */
> -	if ((!anon || PageSwapCache(page)) &&
> +	if (mapping && !mapping_cap_account_dirty(mapping) &&
>  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  		set_page_dirty(page);
>  	/*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10  8:55         ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-10  8:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
...
> > > > Similar problem can also happen when zero_user_segment() call from
> > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > unmapped.
> 
> Similar problem, or is that the whole of the problem?  Where else does
> the page get written to, after clearing page dirty?  (It may not be worth
> spending time to answer me, I feel I'm wasting too much time on this.)
  I think the devil is in "after clearing page dirty" -
clear_page_dirty_for_io() has an optimization that it does not bother
transfering pte or storage key dirty bits to page dirty bit when page is
not mapped. On s390 that results in storage key dirty bit set once buffered
write modifies the page.

BTW there's no other place I'm aware of (and I was looking for some time
before I realized that storage key could remain set from buffered write as
described above).
> 
> I keep trying to put my finger on the precise bug.  I said in earlier
> mails to Mel and to Martin that we're mixing a bugfix and an optimization,
> but I cannot quite point to the bug.  Could one say that it's precisely at
> the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
> that the storage key ought to be re-cleaned after that?
  I think the precise bug is that we can leave dirty bit in storage key set
after writes from kernel while some parts of kernel assume the bit can be
set only via user mapping.

In a perfect world with infinite computation resources, all writes to
pages from kernel could look like:
	.. assume locked page ..
	page_mkclean(page);
	if (page_test_and_clear_dirty(page))
		set_page_dirty(page);
	write to page
	page_test_and_clear_dirty(page);	/* Clean storage key */

This would be bulletproof ... and ridiculously expensive.

> What if one day I happened to copy that code into shmem_writepage()?
> I've no intention to do so!  And it wouldn't cause a BUG.  Ah, and we
> never write shmem to swap while it's still mapped, so it wouldn't even
> have a chance to redirty the page in page_remove_rmap().
> 
> I guess I'm worrying too much; but it's not crystal clear to me why any
> !mapping_cap_account_dirty mapping would necessarily not have the problem.
  They can have a problem - if they cared that page_remove_rmap() can mark
as dirty a page which was never written to via mmap. So far we are lucky
and all !mapping_cap_account_dirty users don't care.

> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > 	if ((PageSwapCache(page) ||
> > 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > 		set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  OK, I'll push the kernel with your updated patch to our build machines
and let it run there for a few days (it took about a day to reproduce the
issue originally). Thanks a lot for helping me with this.

								Honza


>  mm/rmap.c |   20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> --- 3.6.0+/mm/rmap.c	2012-10-09 14:01:12.356379322 -0700
> +++ linux/mm/rmap.c	2012-10-09 14:58:48.160445605 -0700
> @@ -56,6 +56,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
>  #include <linux/hugetlb.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -926,11 +927,8 @@ int page_mkclean(struct page *page)
>  
>  	if (page_mapped(page)) {
>  		struct address_space *mapping = page_mapping(page);
> -		if (mapping) {
> +		if (mapping)
>  			ret = page_mkclean_file(mapping, page);
> -			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> -				ret = 1;
> -		}
>  	}
>  
>  	return ret;
> @@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
>   */
>  void page_remove_rmap(struct page *page)
>  {
> +	struct address_space *mapping = page_mapping(page);
>  	bool anon = PageAnon(page);
>  	bool locked;
>  	unsigned long flags;
> @@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
>  	 * this if the page is anon, so about to be freed; but perhaps
>  	 * not if it's in swapcache - there might be another pte slot
>  	 * containing the swap entry, but page not yet written to swap.
> +	 *
> +	 * And we can skip it on file pages, so long as the filesystem
> +	 * participates in dirty tracking; but need to catch shm and tmpfs
> +	 * and ramfs pages which have been modified since creation by read
> +	 * fault.
> +	 *
> +	 * Note that mapping must be decided above, before decrementing
> +	 * mapcount (which luckily provides a barrier): once page is unmapped,
> +	 * it could be truncated and page->mapping reset to NULL at any moment.
> +	 * Note also that we are relying on page_mapping(page) to set mapping
> +	 * to &swapper_space when PageSwapCache(page).
>  	 */
> -	if ((!anon || PageSwapCache(page)) &&
> +	if (mapping && !mapping_cap_account_dirty(mapping) &&
>  	    page_test_and_clear_dirty(page_to_pfn(page), 1))
>  		set_page_dirty(page);
>  	/*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-10  8:55         ` Jan Kara
  (?)
@ 2012-10-10 21:28           ` Hugh Dickins
  -1 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10 21:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman, linux-s390

On Wed, 10 Oct 2012, Jan Kara wrote:
> On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> ...
> > > > > Similar problem can also happen when zero_user_segment() call from
> > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > > unmapped.
> > 
> > Similar problem, or is that the whole of the problem?  Where else does
> > the page get written to, after clearing page dirty?  (It may not be worth
> > spending time to answer me, I feel I'm wasting too much time on this.)
>   I think the devil is in "after clearing page dirty" -
> clear_page_dirty_for_io() has an optimization that it does not bother
> transfering pte or storage key dirty bits to page dirty bit when page is
> not mapped.

Right, its "if (page_mkclean) set_page_dirty".

> On s390 that results in storage key dirty bit set once buffered
> write modifies the page.

Ah yes, because set_page_dirty does not clean the storage key,
as perhaps I was expecting (and we wouldn't want to add that if
everything is working without).

> 
> BTW there's no other place I'm aware of (and I was looking for some time
> before I realized that storage key could remain set from buffered write as
> described above).

> > I guess I'm worrying too much; but it's not crystal clear to me why any
> > !mapping_cap_account_dirty mapping would necessarily not have the problem.
>   They can have a problem - if they cared that page_remove_rmap() can mark
> as dirty a page which was never written to via mmap. So far we are lucky
> and all !mapping_cap_account_dirty users don't care.

Yes, I think it's good enough: it's a workaround rather than a thorough
future-proof fix; a workaround with a nice optimization bonus for s390.

> > >   Things should be ok (modulo the ugliness of this condition), right?
> > 
> > (Setting aside my reservations above...) That's almost exactly right, but
> > I think the issue of a racing truncation (which could reset page->mapping
> > to NULL at any moment) means we have to be a bit more careful.  Usually
> > we guard against that with page lock, but here we can rely on mapcount.
> > 
> > page_mapping(page), with its built-in PageSwapCache check, actually ends
> > up making the condition look less ugly; and so far as I could tell,
> > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> > when we are left with its VM_BUG_ON(PageSlab(page))).
> > 
> > But please look this over very critically and test (and if you like it,
> > please adopt it as your own): I'm not entirely convinced yet myself.
>   OK, I'll push the kernel with your updated patch to our build machines
> and let it run there for a few days (it took about a day to reproduce the
> issue originally). Thanks a lot for helping me with this.

And thank you for explaining it repeatedly for me.

I expect you're most interested in testing the XFS end of it; but if
you've time to check the swap/tmpfs aspect too, fsx on tmpfs while
heavily swapping should do it.

But perhaps these machines aren't much into heavy swapping.  Now, 
if Martin would send me a nice little zSeries netbook for Xmas,
I could then test that end of it myself ;)

I've just arrived at the conclusion that page migration does _not_
have a problem with transferring the dirty storage key: I had been
thinking that your testing might stumble on that issue, and need a
further patch, but I'll explain in other mail why now I think not.

Hugh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10 21:28           ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10 21:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-s390, LKML, xfs, linux-mm, Mel Gorman, Martin Schwidefsky

On Wed, 10 Oct 2012, Jan Kara wrote:
> On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> ...
> > > > > Similar problem can also happen when zero_user_segment() call from
> > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > > unmapped.
> > 
> > Similar problem, or is that the whole of the problem?  Where else does
> > the page get written to, after clearing page dirty?  (It may not be worth
> > spending time to answer me, I feel I'm wasting too much time on this.)
>   I think the devil is in "after clearing page dirty" -
> clear_page_dirty_for_io() has an optimization that it does not bother
> transfering pte or storage key dirty bits to page dirty bit when page is
> not mapped.

Right, its "if (page_mkclean) set_page_dirty".

> On s390 that results in storage key dirty bit set once buffered
> write modifies the page.

Ah yes, because set_page_dirty does not clean the storage key,
as perhaps I was expecting (and we wouldn't want to add that if
everything is working without).

> 
> BTW there's no other place I'm aware of (and I was looking for some time
> before I realized that storage key could remain set from buffered write as
> described above).

> > I guess I'm worrying too much; but it's not crystal clear to me why any
> > !mapping_cap_account_dirty mapping would necessarily not have the problem.
>   They can have a problem - if they cared that page_remove_rmap() can mark
> as dirty a page which was never written to via mmap. So far we are lucky
> and all !mapping_cap_account_dirty users don't care.

Yes, I think it's good enough: it's a workaround rather than a thorough
future-proof fix; a workaround with a nice optimization bonus for s390.

> > >   Things should be ok (modulo the ugliness of this condition), right?
> > 
> > (Setting aside my reservations above...) That's almost exactly right, but
> > I think the issue of a racing truncation (which could reset page->mapping
> > to NULL at any moment) means we have to be a bit more careful.  Usually
> > we guard against that with page lock, but here we can rely on mapcount.
> > 
> > page_mapping(page), with its built-in PageSwapCache check, actually ends
> > up making the condition look less ugly; and so far as I could tell,
> > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> > when we are left with its VM_BUG_ON(PageSlab(page))).
> > 
> > But please look this over very critically and test (and if you like it,
> > please adopt it as your own): I'm not entirely convinced yet myself.
>   OK, I'll push the kernel with your updated patch to our build machines
> and let it run there for a few days (it took about a day to reproduce the
> issue originally). Thanks a lot for helping me with this.

And thank you for explaining it repeatedly for me.

I expect you're most interested in testing the XFS end of it; but if
you've time to check the swap/tmpfs aspect too, fsx on tmpfs while
heavily swapping should do it.

But perhaps these machines aren't much into heavy swapping.  Now, 
if Martin would send me a nice little zSeries netbook for Xmas,
I could then test that end of it myself ;)

I've just arrived at the conclusion that page migration does _not_
have a problem with transferring the dirty storage key: I had been
thinking that your testing might stumble on that issue, and need a
further patch, but I'll explain in other mail why now I think not.

Hugh

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10 21:28           ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10 21:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman, linux-s390

On Wed, 10 Oct 2012, Jan Kara wrote:
> On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> ...
> > > > > Similar problem can also happen when zero_user_segment() call from
> > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > > unmapped.
> > 
> > Similar problem, or is that the whole of the problem?  Where else does
> > the page get written to, after clearing page dirty?  (It may not be worth
> > spending time to answer me, I feel I'm wasting too much time on this.)
>   I think the devil is in "after clearing page dirty" -
> clear_page_dirty_for_io() has an optimization that it does not bother
> transfering pte or storage key dirty bits to page dirty bit when page is
> not mapped.

Right, its "if (page_mkclean) set_page_dirty".

> On s390 that results in storage key dirty bit set once buffered
> write modifies the page.

Ah yes, because set_page_dirty does not clean the storage key,
as perhaps I was expecting (and we wouldn't want to add that if
everything is working without).

> 
> BTW there's no other place I'm aware of (and I was looking for some time
> before I realized that storage key could remain set from buffered write as
> described above).

> > I guess I'm worrying too much; but it's not crystal clear to me why any
> > !mapping_cap_account_dirty mapping would necessarily not have the problem.
>   They can have a problem - if they cared that page_remove_rmap() can mark
> as dirty a page which was never written to via mmap. So far we are lucky
> and all !mapping_cap_account_dirty users don't care.

Yes, I think it's good enough: it's a workaround rather than a thorough
future-proof fix; a workaround with a nice optimization bonus for s390.

> > >   Things should be ok (modulo the ugliness of this condition), right?
> > 
> > (Setting aside my reservations above...) That's almost exactly right, but
> > I think the issue of a racing truncation (which could reset page->mapping
> > to NULL at any moment) means we have to be a bit more careful.  Usually
> > we guard against that with page lock, but here we can rely on mapcount.
> > 
> > page_mapping(page), with its built-in PageSwapCache check, actually ends
> > up making the condition look less ugly; and so far as I could tell,
> > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> > when we are left with its VM_BUG_ON(PageSlab(page))).
> > 
> > But please look this over very critically and test (and if you like it,
> > please adopt it as your own): I'm not entirely convinced yet myself.
>   OK, I'll push the kernel with your updated patch to our build machines
> and let it run there for a few days (it took about a day to reproduce the
> issue originally). Thanks a lot for helping me with this.

And thank you for explaining it repeatedly for me.

I expect you're most interested in testing the XFS end of it; but if
you've time to check the swap/tmpfs aspect too, fsx on tmpfs while
heavily swapping should do it.

But perhaps these machines aren't much into heavy swapping.  Now, 
if Martin would send me a nice little zSeries netbook for Xmas,
I could then test that end of it myself ;)

I've just arrived at the conclusion that page migration does _not_
have a problem with transferring the dirty storage key: I had been
thinking that your testing might stumble on that issue, and need a
further patch, but I'll explain in other mail why now I think not.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-10  2:19       ` Hugh Dickins
  (?)
@ 2012-10-10 21:56         ` Dave Chinner
  -1 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2012-10-10 21:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
> > > 
> > > What changed recently?  Was XFS hardly used on s390 until now?
> >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > XFS just started to be more peevish about what pages it gets between these
> > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > up).
> 
> Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> ASSERT usually compiled out, stumbling later in page_buffers() as you say.

What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10 21:56         ` Dave Chinner
  0 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2012-10-10 21:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman,
	Martin Schwidefsky

On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
> > > 
> > > What changed recently?  Was XFS hardly used on s390 until now?
> >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > XFS just started to be more peevish about what pages it gets between these
> > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > up).
> 
> Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> ASSERT usually compiled out, stumbling later in page_buffers() as you say.

What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10 21:56         ` Dave Chinner
  0 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2012-10-10 21:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > 
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > > 
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
> > > 
> > > What changed recently?  Was XFS hardly used on s390 until now?
> >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > XFS just started to be more peevish about what pages it gets between these
> > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > up).
> 
> Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> ASSERT usually compiled out, stumbling later in page_buffers() as you say.

What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09 23:21       ` Hugh Dickins
  (?)
@ 2012-10-10 21:57         ` Hugh Dickins
  -1 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10 21:57 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Tue, 9 Oct 2012, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > A separate worry came to mind as I thought about your patch: where
> > > in page migration is s390's dirty storage key migrated from old page
> > > to new?  And if there is a problem there, that too should be fixed
> > > by what I propose in the previous paragraph.
> > 
> > That is covered by the SetPageUptodate() in migrate_page_copy().
> 
> I don't think so: that makes sure that the newpage is not marked
> dirty in storage key just because of the copy_highpage to it; but
> I see nothing to mark the newpage dirty in storage key when the
> old page was dirty there.

I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been reported.

Most of it is handled by page migration's unmap_and_move() having to
unmap the old page first: so the old page will pass through the final
page_remove_rmap(), which will transfer storage key to page_dirty in
those cases which it deals with (with the old code, any file or swap
page; with the new code, any unaccounted file or swap page, now that
we realize the accounted files don't even need this); and page_dirty
is already properly migrated to the new page.

But that does leave one case behind: an anonymous page not yet in
swapcache, migrated via a swap-like migration entry.  But this case
is not a problem because PageDirty doesn't actually affect anything
for an anonymous page not in swapcache.  There are various places
where we set it, and its life-history is hard to make sense of, but
in fact it's meaningless in 2.6, where page reclaim adds anon to swap
(and sets PageDirty) whether the page was marked dirty before or not
(which makes sense when we use the ZERO_PAGE for anon read faults).

2.4 did behave differently: it was liable to free anon pages not
marked dirty, and I think most of our anon SetPageDirtys are just a
relic of those days - I do have a patch from 18 months ago to remove
them (adding PG_dirty to the flags which should not be set when a
page is freed), but there are usually more urgent things to attend
to than rebase and retest that.

Hugh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10 21:57         ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10 21:57 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman

On Tue, 9 Oct 2012, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > A separate worry came to mind as I thought about your patch: where
> > > in page migration is s390's dirty storage key migrated from old page
> > > to new?  And if there is a problem there, that too should be fixed
> > > by what I propose in the previous paragraph.
> > 
> > That is covered by the SetPageUptodate() in migrate_page_copy().
> 
> I don't think so: that makes sure that the newpage is not marked
> dirty in storage key just because of the copy_highpage to it; but
> I see nothing to mark the newpage dirty in storage key when the
> old page was dirty there.

I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been reported.

Most of it is handled by page migration's unmap_and_move() having to
unmap the old page first: so the old page will pass through the final
page_remove_rmap(), which will transfer storage key to page_dirty in
those cases which it deals with (with the old code, any file or swap
page; with the new code, any unaccounted file or swap page, now that
we realize the accounted files don't even need this); and page_dirty
is already properly migrated to the new page.

But that does leave one case behind: an anonymous page not yet in
swapcache, migrated via a swap-like migration entry.  But this case
is not a problem because PageDirty doesn't actually affect anything
for an anonymous page not in swapcache.  There are various places
where we set it, and its life-history is hard to make sense of, but
in fact it's meaningless in 2.6, where page reclaim adds anon to swap
(and sets PageDirty) whether the page was marked dirty before or not
(which makes sense when we use the ZERO_PAGE for anon read faults).

2.4 did behave differently: it was liable to free anon pages not
marked dirty, and I think most of our anon SetPageDirtys are just a
relic of those days - I do have a patch from 18 months ago to remove
them (adding PG_dirty to the flags which should not be set when a
page is freed), but there are usually more urgent things to attend
to than rebase and retest that.

Hugh

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-10 21:57         ` Hugh Dickins
  0 siblings, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-10 21:57 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Tue, 9 Oct 2012, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > A separate worry came to mind as I thought about your patch: where
> > > in page migration is s390's dirty storage key migrated from old page
> > > to new?  And if there is a problem there, that too should be fixed
> > > by what I propose in the previous paragraph.
> > 
> > That is covered by the SetPageUptodate() in migrate_page_copy().
> 
> I don't think so: that makes sure that the newpage is not marked
> dirty in storage key just because of the copy_highpage to it; but
> I see nothing to mark the newpage dirty in storage key when the
> old page was dirty there.

I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been reported.

Most of it is handled by page migration's unmap_and_move() having to
unmap the old page first: so the old page will pass through the final
page_remove_rmap(), which will transfer storage key to page_dirty in
those cases which it deals with (with the old code, any file or swap
page; with the new code, any unaccounted file or swap page, now that
we realize the accounted files don't even need this); and page_dirty
is already properly migrated to the new page.

But that does leave one case behind: an anonymous page not yet in
swapcache, migrated via a swap-like migration entry.  But this case
is not a problem because PageDirty doesn't actually affect anything
for an anonymous page not in swapcache.  There are various places
where we set it, and its life-history is hard to make sense of, but
in fact it's meaningless in 2.6, where page reclaim adds anon to swap
(and sets PageDirty) whether the page was marked dirty before or not
(which makes sense when we use the ZERO_PAGE for anon read faults).

2.4 did behave differently: it was liable to free anon pages not
marked dirty, and I think most of our anon SetPageDirtys are just a
relic of those days - I do have a patch from 18 months ago to remove
them (adding PG_dirty to the flags which should not be set when a
page is freed), but there are usually more urgent things to attend
to than rebase and retest that.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-10 21:28           ` Hugh Dickins
  (?)
@ 2012-10-11  7:42             ` Martin Schwidefsky
  -1 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-11  7:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Wed, 10 Oct 2012 14:28:32 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> But perhaps these machines aren't much into heavy swapping.  Now, 
> if Martin would send me a nice little zSeries netbook for Xmas,
> I could then test that end of it myself ;)

Are you sure about that? The electricity cost alone for such a beast
is quite high ;-)

> I've just arrived at the conclusion that page migration does _not_
> have a problem with transferring the dirty storage key: I had been
> thinking that your testing might stumble on that issue, and need a
> further patch, but I'll explain in other mail why now I think not.

That is good to know, one problem less on the list.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-11  7:42             ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-11  7:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman

On Wed, 10 Oct 2012 14:28:32 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> But perhaps these machines aren't much into heavy swapping.  Now, 
> if Martin would send me a nice little zSeries netbook for Xmas,
> I could then test that end of it myself ;)

Are you sure about that? The electricity cost alone for such a beast
is quite high ;-)

> I've just arrived at the conclusion that page migration does _not_
> have a problem with transferring the dirty storage key: I had been
> thinking that your testing might stumble on that issue, and need a
> further patch, but I'll explain in other mail why now I think not.

That is good to know, one problem less on the list.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-11  7:42             ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-11  7:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Wed, 10 Oct 2012 14:28:32 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> But perhaps these machines aren't much into heavy swapping.  Now, 
> if Martin would send me a nice little zSeries netbook for Xmas,
> I could then test that end of it myself ;)

Are you sure about that? The electricity cost alone for such a beast
is quite high ;-)

> I've just arrived at the conclusion that page migration does _not_
> have a problem with transferring the dirty storage key: I had been
> thinking that your testing might stumble on that issue, and need a
> further patch, but I'll explain in other mail why now I think not.

That is good to know, one problem less on the list.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-10 21:56         ` Dave Chinner
  (?)
@ 2012-10-11  7:44           ` Martin Schwidefsky
  -1 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-11  7:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Thu, 11 Oct 2012 08:56:00 +1100
Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> > > > 
> > > > What changed recently?  Was XFS hardly used on s390 until now?
> > >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > > XFS just started to be more peevish about what pages it gets between these
> > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > > up).
> > 
> > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> > ASSERT usually compiled out, stumbling later in page_buffers() as you say.
> 
> What that says is that no-one is running xfstests-based QA on s390
> with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
> I've never tested XFS on s390 before, and I doubt any of the
> upstream developers have, either, because not many peopl ehave s390
> machines in their basement. So this is probably just an oversight
> in the distro QA environment more than anything....

Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and
watch for the fallout.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-11  7:44           ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-11  7:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-s390, Jan Kara, Hugh Dickins, LKML, xfs, linux-mm, Mel Gorman

On Thu, 11 Oct 2012 08:56:00 +1100
Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> > > > 
> > > > What changed recently?  Was XFS hardly used on s390 until now?
> > >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > > XFS just started to be more peevish about what pages it gets between these
> > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > > up).
> > 
> > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> > ASSERT usually compiled out, stumbling later in page_buffers() as you say.
> 
> What that says is that no-one is running xfstests-based QA on s390
> with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
> I've never tested XFS on s390 before, and I doubt any of the
> upstream developers have, either, because not many peopl ehave s390
> machines in their basement. So this is probably just an oversight
> in the distro QA environment more than anything....

Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and
watch for the fallout.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-11  7:44           ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-11  7:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Thu, 11 Oct 2012 08:56:00 +1100
Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > > 
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > > 
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> > > > 
> > > > What changed recently?  Was XFS hardly used on s390 until now?
> > >   The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > > XFS just started to be more peevish about what pages it gets between these
> > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > > up).
> > 
> > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> > ASSERT usually compiled out, stumbling later in page_buffers() as you say.
> 
> What that says is that no-one is running xfstests-based QA on s390
> with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
> I've never tested XFS on s390 before, and I doubt any of the
> upstream developers have, either, because not many peopl ehave s390
> machines in their basement. So this is probably just an oversight
> in the distro QA environment more than anything....

Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and
watch for the fallout.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-10  2:19       ` Hugh Dickins
  (?)
@ 2012-10-17  0:43         ` Jan Kara
  -1 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-17  0:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
<snip a lot>
> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > 	if ((PageSwapCache(page) ||
> > 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > 		set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  Just to followup on this. The new version of the patch runs fine for
several days on our s390 build machines. I was also running fsx-linux on
tmpfs while pushing the machine to swap. fsx ran fine but I hit
WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is:
 [<000003c008edb38e>] xfs_vm_releasepage+0xc6/0xd4 [xfs]
 [<0000000000213326>] shrink_page_list+0x6ba/0x734
 [<0000000000213924>] shrink_inactive_list+0x230/0x578
 [<0000000000214148>] shrink_list+0x6c/0x120
 [<00000000002143ee>] shrink_zone+0x1f2/0x238
 [<0000000000215482>] balance_pgdat+0x5f6/0x86c
 [<00000000002158b8>] kswapd+0x1c0/0x248
 [<000000000017642a>] kthread+0xa6/0xb0
 [<00000000004e58be>] kernel_thread_starter+0x6/0xc
 [<00000000004e58b8>] kernel_thread_starter+0x0/0xc

I don't think it is really related but I'll hold off the patch for a while
to investigate what's going on...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-17  0:43         ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-17  0:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman,
	Martin Schwidefsky

On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
<snip a lot>
> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > 	if ((PageSwapCache(page) ||
> > 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > 		set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  Just to followup on this. The new version of the patch runs fine for
several days on our s390 build machines. I was also running fsx-linux on
tmpfs while pushing the machine to swap. fsx ran fine but I hit
WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is:
 [<000003c008edb38e>] xfs_vm_releasepage+0xc6/0xd4 [xfs]
 [<0000000000213326>] shrink_page_list+0x6ba/0x734
 [<0000000000213924>] shrink_inactive_list+0x230/0x578
 [<0000000000214148>] shrink_list+0x6c/0x120
 [<00000000002143ee>] shrink_zone+0x1f2/0x238
 [<0000000000215482>] balance_pgdat+0x5f6/0x86c
 [<00000000002158b8>] kswapd+0x1c0/0x248
 [<000000000017642a>] kthread+0xa6/0xb0
 [<00000000004e58be>] kernel_thread_starter+0x6/0xc
 [<00000000004e58b8>] kernel_thread_starter+0x0/0xc

I don't think it is really related but I'll hold off the patch for a while
to investigate what's going on...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-17  0:43         ` Jan Kara
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2012-10-17  0:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, linux-mm, LKML, xfs, Martin Schwidefsky, Mel Gorman,
	linux-s390

On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
<snip a lot>
> > > But here's where I think the problem is.  You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > > 
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> >   I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > 	if ((PageSwapCache(page) ||
> > 	     (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> > 		set_page_dirty(page);
> > 
> >   Things should be ok (modulo the ugliness of this condition), right?
> 
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful.  Usually
> we guard against that with page lock, but here we can rely on mapcount.
> 
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
> 
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
  Just to followup on this. The new version of the patch runs fine for
several days on our s390 build machines. I was also running fsx-linux on
tmpfs while pushing the machine to swap. fsx ran fine but I hit
WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is:
 [<000003c008edb38e>] xfs_vm_releasepage+0xc6/0xd4 [xfs]
 [<0000000000213326>] shrink_page_list+0x6ba/0x734
 [<0000000000213924>] shrink_inactive_list+0x230/0x578
 [<0000000000214148>] shrink_list+0x6c/0x120
 [<00000000002143ee>] shrink_zone+0x1f2/0x238
 [<0000000000215482>] balance_pgdat+0x5f6/0x86c
 [<00000000002158b8>] kswapd+0x1c0/0x248
 [<000000000017642a>] kthread+0xa6/0xb0
 [<00000000004e58be>] kernel_thread_starter+0x6/0xc
 [<00000000004e58b8>] kernel_thread_starter+0x0/0xc

I don't think it is really related but I'll hold off the patch for a while
to investigate what's going on...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-09 23:21       ` Hugh Dickins
  (?)
@ 2012-10-19 14:38         ` Martin Schwidefsky
  -1 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-19 14:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Tue, 9 Oct 2012 16:21:24 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> > 
> > I am seriously tempted to switch to pure software dirty bits by using
> > page protection for writable but clean pages. The worry is the number of
> > additional protection faults we would get. But as we do software dirty
> > bit tracking for the most part anyway this might not be as bad as it
> > used to be.  
> 
> That's exactly the same reason why tmpfs opts out of dirty tracking, fear
> of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
> anomalous too, and I'd be a hypocrite to push for you to make that change.

I tested the waters with the software dirty bit idea. Using kernel compile
as test case I got these numbers:

disk backing, swdirty: 10,023,870 minor-faults 18 major-faults
disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults                          

tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults
tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults

That does not look bad at all. One test I found that shows an effect is
lat_mmap from LMBench:

disk backing, hwdirty: 30,894 minor-faults 0 major-faults
disk backing, swdirty: 30,894 minor-faults 0 major-faults

tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults
tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults 

The runtime between the hwdirty vs. the swdirty setup is very similar,
encouraging enough for me to ask our performance team to run a larger test.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-19 14:38         ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-19 14:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-s390, Jan Kara, LKML, xfs, linux-mm, Mel Gorman

On Tue, 9 Oct 2012 16:21:24 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> > 
> > I am seriously tempted to switch to pure software dirty bits by using
> > page protection for writable but clean pages. The worry is the number of
> > additional protection faults we would get. But as we do software dirty
> > bit tracking for the most part anyway this might not be as bad as it
> > used to be.  
> 
> That's exactly the same reason why tmpfs opts out of dirty tracking, fear
> of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
> anomalous too, and I'd be a hypocrite to push for you to make that change.

I tested the waters with the software dirty bit idea. Using kernel compile
as test case I got these numbers:

disk backing, swdirty: 10,023,870 minor-faults 18 major-faults
disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults                          

tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults
tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults

That does not look bad at all. One test I found that shows an effect is
lat_mmap from LMBench:

disk backing, hwdirty: 30,894 minor-faults 0 major-faults
disk backing, swdirty: 30,894 minor-faults 0 major-faults

tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults
tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults 

The runtime between the hwdirty vs. the swdirty setup is very similar,
encouraging enough for me to ask our performance team to run a larger test.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-19 14:38         ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-19 14:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, linux-mm, LKML, xfs, Mel Gorman, linux-s390

On Tue, 9 Oct 2012 16:21:24 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> > 
> > I am seriously tempted to switch to pure software dirty bits by using
> > page protection for writable but clean pages. The worry is the number of
> > additional protection faults we would get. But as we do software dirty
> > bit tracking for the most part anyway this might not be as bad as it
> > used to be.  
> 
> That's exactly the same reason why tmpfs opts out of dirty tracking, fear
> of unnecessary extra faults.  Anomalous as s390 is here, tmpfs is being
> anomalous too, and I'd be a hypocrite to push for you to make that change.

I tested the waters with the software dirty bit idea. Using kernel compile
as test case I got these numbers:

disk backing, swdirty: 10,023,870 minor-faults 18 major-faults
disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults                          

tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults
tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults

That does not look bad at all. One test I found that shows an effect is
lat_mmap from LMBench:

disk backing, hwdirty: 30,894 minor-faults 0 major-faults
disk backing, swdirty: 30,894 minor-faults 0 major-faults

tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults
tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults 

The runtime between the hwdirty vs. the swdirty setup is very similar,
encouraging enough for me to ask our performance team to run a larger test.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-12-17 23:31           ` Hugh Dickins
@ 2012-12-18  7:30             ` Martin Schwidefsky
  0 siblings, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-12-18  7:30 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, Andrew Morton, linux-mm, Mel Gorman, linux-s390

On Mon, 17 Dec 2012 15:31:47 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> On Fri, 14 Dec 2012, Martin Schwidefsky wrote:
> > 
> > The patch got delayed a bit,
> 
> Thanks a lot for finding the time to do this:
> I never expected it to get priority.
> 
> > the main issue is to get conclusive performance
> > measurements about the effects of the patch. I am pretty sure that the patch
> > works and will not cause any major degradation so it is time to ask for your
> > opinion. Here we go:
> 
> If if works reliably and efficiently for you on s390, then I'm strongly in
> favour of it; and I cannot imagine who would not be - it removes several
> hunks of surprising and poorly understood code from the generic mm end.
> 
> I'm slightly disappointed to be reminded of page_test_and_clear_young(),
> and find it still there; but it's been an order of magnitude less
> troubling than the _dirty, so not worth more effort I guess.

To remove the dependency on the referenced-bit in the storage key would
require to set the invalid bit on the pte until the first access has been
done. Then the referenced bit would have to be set and a valid pte can
be established. That would be costly, because we would get a lot more
program checks on the invalid, old ptes. So the page_test_and_clear_young
needs to stay. The situation for the referenced bits is much more relaxed
though, we can afford to loose the one of the other referenced bit
without ill effect. I would not worry about page_test_and_clear_young
too much.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-12-14  8:45         ` Martin Schwidefsky
@ 2012-12-17 23:31           ` Hugh Dickins
  2012-12-18  7:30             ` Martin Schwidefsky
  0 siblings, 1 reply; 61+ messages in thread
From: Hugh Dickins @ 2012-12-17 23:31 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Jan Kara, Andrew Morton, linux-mm, Mel Gorman, linux-s390

On Fri, 14 Dec 2012, Martin Schwidefsky wrote:
> 
> The patch got delayed a bit,

Thanks a lot for finding the time to do this:
I never expected it to get priority.

> the main issue is to get conclusive performance
> measurements about the effects of the patch. I am pretty sure that the patch
> works and will not cause any major degradation so it is time to ask for your
> opinion. Here we go:

If if works reliably and efficiently for you on s390, then I'm strongly in
favour of it; and I cannot imagine who would not be - it removes several
hunks of surprising and poorly understood code from the generic mm end.

I'm slightly disappointed to be reminded of page_test_and_clear_young(),
and find it still there; but it's been an order of magnitude less
troubling than the _dirty, so not worth more effort I guess.

Hugh

> --
> Subject: [PATCH] s390/mm: implement software dirty bits
> 
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> The s390 architecture is unique in respect to dirty page detection,
> it uses the change bit in the per-page storage key to track page
> modifications. All other architectures track dirty bits by means
> of page table entries. This property of s390 has caused numerous
> problems in the past, e.g. see git commit ef5d437f71afdf4a
> "mm: fix XFS oops due to dirty pages without buffers on s390".
> 
> To avoid future issues in regard to per-page dirty bits convert
> s390 to a fault based software dirty bit detection mechanism. All
> user page table entries which are marked as clean will be hardware
> read-only, even if the pte is supposed to be writable. A write by
> the user process will trigger a protection fault which will cause
> the user pte to be marked as dirty and the hardware read-only bit
> is removed.
> 
> With this change the dirty bit in the storage key is irrelevant
> for Linux as a host, but the storage key is still required for
> KVM guests. The effect is that page_test_and_clear_dirty and the
> related code can be removed. The referenced bit in the storage
> key is still used by the page_test_and_clear_young primitive to
> provide page age information.
> 
> For page cache pages of mappings with mapping_cap_account_dirty
> there will not be any change in behavior as the dirty bit tracking
> already uses read-only ptes to control the amount of dirty pages.
> Only for swap cache pages and pages of mappings without
> mapping_cap_account_dirty there can be additional protection faults.
> To avoid an excessive number of additional faults the mk_pte
> primitive checks for PageDirty if the pgprot value allows for writes
> and pre-dirties the pte. That avoids all additional faults for
> tmpfs and shmem pages until these pages are added to the swap cache.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/include/asm/page.h    |  22 -------
>  arch/s390/include/asm/pgtable.h | 131 +++++++++++++++++++++++++++-------------
>  arch/s390/include/asm/sclp.h    |   1 -
>  arch/s390/include/asm/setup.h   |  16 ++---
>  arch/s390/kvm/kvm-s390.c        |   2 +-
>  arch/s390/lib/uaccess_pt.c      |   2 +-
>  arch/s390/mm/pageattr.c         |   2 +-
>  arch/s390/mm/vmem.c             |  24 +++-----
>  drivers/s390/char/sclp_cmd.c    |  10 +--
>  include/asm-generic/pgtable.h   |  10 ---
>  include/linux/page-flags.h      |   8 ---
>  mm/rmap.c                       |  23 -------
>  12 files changed, 112 insertions(+), 139 deletions(-)
> 
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index a86ad40840..75ce9b0 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
>  #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
>  
>  /*
> - * Test and clear dirty bit in storage key.
> - * We can't clear the changed bit atomically. This is a potential
> - * race against modification of the referenced bit. This function
> - * should therefore only be called if it is not mapped in any
> - * address space.
> - *
> - * Note that the bit gets set whenever page content is changed. That means
> - * also when the page is modified by DMA or from inside the kernel.
> - */
> -#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
> -{
> -	unsigned char skey;
> -
> -	skey = page_get_storage_key(pfn << PAGE_SHIFT);
> -	if (!(skey & _PAGE_CHANGED))
> -		return 0;
> -	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
> -	return 1;
> -}
> -
> -/*
>   * Test and clear referenced bit in storage key.
>   */
>  #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 33aeb77..66d3b2a 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -29,6 +29,7 @@
>  #ifndef __ASSEMBLY__
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
> +#include <linux/page-flags.h>
>  #include <asm/bug.h>
>  #include <asm/page.h>
>  
> @@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
>  /* Software bits in the page table entry */
>  #define _PAGE_SWT	0x001		/* SW pte type bit t */
>  #define _PAGE_SWX	0x002		/* SW pte type bit x */
> -#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
> -#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
> -#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
> +#define _PAGE_SWC	0x004		/* SW pte changed bit */
> +#define _PAGE_SWR	0x008		/* SW pte referenced bit */
> +#define _PAGE_SWW	0x010		/* SW pte write bit */
> +#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
>  /* Set of bits not changed in pte_modify */
> -#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
> +#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
> +				 _PAGE_SWC | _PAGE_SWR)
>  
>  /* Six different types of pages. */
>  #define _PAGE_TYPE_EMPTY	0x400
> @@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
>  
>  /* Bits in the region table entry */
>  #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
> +#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
>  #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
>  #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
>  #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
> @@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
>   */
>  #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
>  #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
> -#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
> +#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
> +#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
>  
> -#define PAGE_KERNEL	PAGE_RW
> +#define PAGE_KERNEL	PAGE_RWC
>  #define PAGE_COPY	PAGE_RO
>  
>  /*
> @@ -625,23 +630,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
>  	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
>  	/* Clear page changed & referenced bit in the storage key */
>  	if (bits & _PAGE_CHANGED)
> -		page_set_storage_key(address, skey ^ bits, 1);
> +		page_set_storage_key(address, skey ^ bits, 0);
>  	else if (bits)
>  		page_reset_referenced(address);
>  	/* Transfer page changed & referenced bit to guest bits in pgste */
>  	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
>  	/* Get host changed & referenced bits from pgste */
>  	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
> -	/* Clear host bits in pgste. */
> +	/* Transfer page changed & referenced bit to kvm user bits */
> +	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> +	/* Clear relevant host bits in pgste. */
>  	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
>  	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
>  	/* Copy page access key and fetch protection bit to pgste */
>  	pgste_val(pgste) |=
>  		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
> -	/* Transfer changed and referenced to kvm user bits */
> -	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> -	/* Transfer changed & referenced to pte sofware bits */
> -	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
> +	/* Transfer referenced bit to pte */
> +	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
>  #endif
>  	return pgste;
>  
> @@ -654,20 +659,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
>  
>  	if (!pte_present(*ptep))
>  		return pgste;
> +	/* Get referenced bit from storage key */
>  	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
> -	/* Transfer page referenced bit to pte software bit (host view) */
> -	if (young || (pgste_val(pgste) & RCP_HR_BIT))
> +	if (young)
> +		pgste_val(pgste) |= RCP_GR_BIT;
> +	/* Get host referenced bit from pgste */
> +	if (pgste_val(pgste) & RCP_HR_BIT) {
> +		pgste_val(pgste) &= ~RCP_HR_BIT;
> +		young = 1;
> +	}
> +	/* Transfer referenced bit to kvm user bits and pte */
> +	if (young) {
> +		pgste_val(pgste) |= KVM_UR_BIT;
>  		pte_val(*ptep) |= _PAGE_SWR;
> -	/* Clear host referenced bit in pgste. */
> -	pgste_val(pgste) &= ~RCP_HR_BIT;
> -	/* Transfer page referenced bit to guest bit in pgste */
> -	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
> +	}
>  #endif
>  	return pgste;
> -
>  }
>  
> -static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> +static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
>  {
>  #ifdef CONFIG_PGSTE
>  	unsigned long address;
> @@ -681,10 +691,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
>  	/* Set page access key and fetch protection bit from pgste */
>  	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
>  	if (okey != nkey)
> -		page_set_storage_key(address, nkey, 1);
> +		page_set_storage_key(address, nkey, 0);
>  #endif
>  }
>  
> +static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
> +{
> +	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
> +		/*
> +		 * Without enhanced suppression-on-protection force
> +		 * the dirty bit on for all writable ptes.
> +		 */
> +		pte_val(entry) |= _PAGE_SWC;
> +		pte_val(entry) &= ~_PAGE_RO;
> +	}
> +	*ptep = entry;
> +}
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @mm: pointer to the parent mm_struct
> @@ -743,11 +766,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  
>  	if (mm_has_pgste(mm)) {
>  		pgste = pgste_get_lock(ptep);
> -		pgste_set_pte(ptep, pgste, entry);
> -		*ptep = entry;
> +		pgste_set_key(ptep, pgste, entry);
> +		pgste_set_pte(ptep, entry);
>  		pgste_set_unlock(ptep, pgste);
> -	} else
> +	} else {
> +		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
> +			pte_val(entry) |= _PAGE_CO;
>  		*ptep = entry;
> +	}
>  }
>  
>  /*
> @@ -756,16 +782,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>   */
>  static inline int pte_write(pte_t pte)
>  {
> -	return (pte_val(pte) & _PAGE_RO) == 0;
> +	return (pte_val(pte) & _PAGE_SWW) != 0;
>  }
>  
>  static inline int pte_dirty(pte_t pte)
>  {
> -#ifdef CONFIG_PGSTE
> -	if (pte_val(pte) & _PAGE_SWC)
> -		return 1;
> -#endif
> -	return 0;
> +	return (pte_val(pte) & _PAGE_SWC) != 0;
>  }
>  
>  static inline int pte_young(pte_t pte)
> @@ -815,11 +837,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  {
>  	pte_val(pte) &= _PAGE_CHG_MASK;
>  	pte_val(pte) |= pgprot_val(newprot);
> +	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> +	pte_val(pte) &= ~_PAGE_SWW;
>  	/* Do not clobber _PAGE_TYPE_NONE pages!  */
>  	if (!(pte_val(pte) & _PAGE_INVALID))
>  		pte_val(pte) |= _PAGE_RO;
> @@ -828,20 +853,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
>  
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
> -	pte_val(pte) &= ~_PAGE_RO;
> +	pte_val(pte) |= _PAGE_SWW;
> +	if (pte_val(pte) & _PAGE_SWC)
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -#ifdef CONFIG_PGSTE
>  	pte_val(pte) &= ~_PAGE_SWC;
> -#endif
> +	/* Do not clobber _PAGE_TYPE_NONE pages!  */
> +	if (!(pte_val(pte) & _PAGE_INVALID))
> +		pte_val(pte) |= _PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> +	pte_val(pte) |= _PAGE_SWC;
> +	if (pte_val(pte) & _PAGE_SWW)
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
> @@ -879,10 +910,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
>  		pte_val(pte) |= _SEGMENT_ENTRY_INV;
>  	}
>  	/*
> -	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
> -	 * table entry.
> +	 * Clear SW pte bits, there are no SW bits in a segment table entry.
>  	 */
> -	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
> +	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
> +			  _PAGE_SWR | _PAGE_SWW);
>  	/*
>  	 * Also set the change-override bit because we don't need dirty bit
>  	 * tracking for hugetlbfs pages.
> @@ -1053,9 +1084,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
>  					   unsigned long address,
>  					   pte_t *ptep, pte_t pte)
>  {
> -	*ptep = pte;
> -	if (mm_has_pgste(mm))
> +	if (mm_has_pgste(mm)) {
> +		pgste_set_pte(ptep, pte);
>  		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
> +	} else
> +		*ptep = pte;
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
> @@ -1121,10 +1154,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
>  			pgste = pgste_get_lock(ptep);
>  
>  		ptep_flush_lazy(mm, address, ptep);
> -		*ptep = pte_wrprotect(pte);
> +		pte = pte_wrprotect(pte);
>  
> -		if (mm_has_pgste(mm))
> +		if (mm_has_pgste(mm)) {
> +			pgste_set_pte(ptep, pte);
>  			pgste_set_unlock(ptep, pgste);
> +		} else
> +			*ptep = pte;
>  	}
>  	return pte;
>  }
> @@ -1142,10 +1178,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  		pgste = pgste_get_lock(ptep);
>  
>  	__ptep_ipte(address, ptep);
> -	*ptep = entry;
>  
> -	if (mm_has_pgste(vma->vm_mm))
> +	if (mm_has_pgste(vma->vm_mm)) {
> +		pgste_set_pte(ptep, entry);
>  		pgste_set_unlock(ptep, pgste);
> +	} else
> +		*ptep = entry;
>  	return 1;
>  }
>  
> @@ -1163,8 +1201,13 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
>  static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
>  {
>  	unsigned long physpage = page_to_phys(page);
> +	pte_t __pte = mk_pte_phys(physpage, pgprot);
>  
> -	return mk_pte_phys(physpage, pgprot);
> +	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
> +		pte_val(__pte) |= _PAGE_SWC;
> +		pte_val(__pte) &= ~_PAGE_RO;
> +	}
> +	return __pte;
>  }
>  
>  #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
> @@ -1256,6 +1299,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
>  static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  			      pmd_t *pmdp, pmd_t entry)
>  {
> +	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
> +		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
>  	*pmdp = entry;
>  }
>  
> diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
> index 8337886..06a1361 100644
> --- a/arch/s390/include/asm/sclp.h
> +++ b/arch/s390/include/asm/sclp.h
> @@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
>  void sclp_facilities_detect(void);
>  unsigned long long sclp_get_rnmax(void);
>  unsigned long long sclp_get_rzm(void);
> -u8 sclp_get_fac85(void);
>  int sclp_sdias_blk_count(void);
>  int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
>  int sclp_chp_configure(struct chp_id chpid);
> diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
> index f69f76b..f685751 100644
> --- a/arch/s390/include/asm/setup.h
> +++ b/arch/s390/include/asm/setup.h
> @@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
>  
>  #define MACHINE_FLAG_VM		(1UL << 0)
>  #define MACHINE_FLAG_IEEE	(1UL << 1)
> -#define MACHINE_FLAG_CSP	(1UL << 3)
> -#define MACHINE_FLAG_MVPG	(1UL << 4)
> -#define MACHINE_FLAG_DIAG44	(1UL << 5)
> -#define MACHINE_FLAG_IDTE	(1UL << 6)
> -#define MACHINE_FLAG_DIAG9C	(1UL << 7)
> -#define MACHINE_FLAG_MVCOS	(1UL << 8)
> -#define MACHINE_FLAG_KVM	(1UL << 9)
> +#define MACHINE_FLAG_CSP	(1UL << 2)
> +#define MACHINE_FLAG_MVPG	(1UL << 3)
> +#define MACHINE_FLAG_DIAG44	(1UL << 4)
> +#define MACHINE_FLAG_IDTE	(1UL << 5)
> +#define MACHINE_FLAG_DIAG9C	(1UL << 6)
> +#define MACHINE_FLAG_MVCOS	(1UL << 7)
> +#define MACHINE_FLAG_KVM	(1UL << 8)
> +#define MACHINE_FLAG_ESOP	(1UL << 9)
>  #define MACHINE_FLAG_EDAT1	(1UL << 10)
>  #define MACHINE_FLAG_EDAT2	(1UL << 11)
>  #define MACHINE_FLAG_LPAR	(1UL << 12)
> @@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
>  #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
>  
>  #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
> +#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
>  #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
>  #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
>  
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index c9011bf..4659b62 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = KVM_MAX_VCPUS;
>  		break;
>  	case KVM_CAP_S390_COW:
> -		r = sclp_get_fac85() & 0x2;
> +		r = MACHINE_HAS_ESOP;
>  		break;
>  	default:
>  		r = 0;
> diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
> index 9017a63..a70ee84 100644
> --- a/arch/s390/lib/uaccess_pt.c
> +++ b/arch/s390/lib/uaccess_pt.c
> @@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
>  	ptep = pte_offset_map(pmd, addr);
>  	if (!pte_present(*ptep))
>  		return -0x11UL;
> -	if (write && !pte_write(*ptep))
> +	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
>  		return -0x04UL;
>  
>  	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
> diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
> index 29ccee3..d21040e 100644
> --- a/arch/s390/mm/pageattr.c
> +++ b/arch/s390/mm/pageattr.c
> @@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
>  			pte_val(*pte) = _PAGE_TYPE_EMPTY;
>  			continue;
>  		}
> -		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
> +		pte_val(*pte) = __pa(address);
>  	}
>  }
>  
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 6ed1426..79699f46 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  	pud_t *pu_dir;
>  	pmd_t *pm_dir;
>  	pte_t *pt_dir;
> -	pte_t  pte;
>  	int ret = -ENOMEM;
>  
>  	while (address < end) {
> -		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
>  		pg_dir = pgd_offset_k(address);
>  		if (pgd_none(*pg_dir)) {
>  			pu_dir = vmem_pud_alloc();
> @@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
>  		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
>  		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
> -			pte_val(pte) |= _REGION3_ENTRY_LARGE;
> -			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
> -			pud_val(*pu_dir) = pte_val(pte);
> +			pud_val(*pu_dir) = __pa(address) |
> +				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
> +				(ro ? _REGION_ENTRY_RO : 0);
>  			address += PUD_SIZE;
>  			continue;
>  		}
> @@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
>  		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
>  		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
> -			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
> -			pmd_val(*pm_dir) = pte_val(pte);
> +			pmd_val(*pm_dir) = __pa(address) |
> +				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
> +				(ro ? _SEGMENT_ENTRY_RO : 0);
>  			address += PMD_SIZE;
>  			continue;
>  		}
> @@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  		}
>  
>  		pt_dir = pte_offset_kernel(pm_dir, address);
> -		*pt_dir = pte;
> +		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
>  		address += PAGE_SIZE;
>  	}
>  	ret = 0;
> @@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  	pud_t *pu_dir;
>  	pmd_t *pm_dir;
>  	pte_t *pt_dir;
> -	pte_t  pte;
>  	int ret = -ENOMEM;
>  
>  	start_addr = (unsigned long) start;
> @@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  				new_page = vmemmap_alloc_block(PMD_SIZE, node);
>  				if (!new_page)
>  					goto out;
> -				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
> -				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
> -				pmd_val(*pm_dir) = pte_val(pte);
> +				pmd_val(*pm_dir) = __pa(new_page) |
> +					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
>  				address = (address + PMD_SIZE) & PMD_MASK;
>  				continue;
>  			}
> @@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  			new_page =__pa(vmem_alloc_pages(0));
>  			if (!new_page)
>  				goto out;
> -			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
> -			*pt_dir = pte;
> +			pte_val(*pt_dir) = __pa(new_page);
>  		}
>  		address += PAGE_SIZE;
>  	}
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index c44d13f..30a2255 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
>  
>  u64 sclp_facilities;
>  static u8 sclp_fac84;
> -static u8 sclp_fac85;
>  static unsigned long long rzm;
>  static unsigned long long rnmax;
>  
> @@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
>  	sccb = &early_read_info_sccb;
>  	sclp_facilities = sccb->facilities;
>  	sclp_fac84 = sccb->fac84;
> -	sclp_fac85 = sccb->fac85;
> +	if (sccb->fac85 & 0x02)
> +		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
>  	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
>  	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
>  	rzm <<= 20;
> @@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
>  	return rzm;
>  }
>  
> -u8 sclp_get_fac85(void)
> -{
> -	return sclp_fac85;
> -}
> -EXPORT_SYMBOL_GPL(sclp_get_fac85);
> -
>  /*
>   * This function will be called after sclp_facilities_detect(), which gets
>   * called from early.c code. Therefore the sccb should have valid contents.
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 83b54ed..bdd7fac 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> -#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -#define page_test_and_clear_dirty(pfn, mapped)	(0)
> -#endif
> -
> -#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -#define pte_maybe_dirty(pte)		pte_dirty(pte)
> -#else
> -#define pte_maybe_dirty(pte)		(1)
> -#endif
> -
>  #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
>  #define page_test_and_clear_young(pfn) (0)
>  #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b5d1384..4c0c8eb 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
>  
>  static inline void SetPageUptodate(struct page *page)
>  {
> -#ifdef CONFIG_S390
> -	if (!test_and_set_bit(PG_uptodate, &page->flags))
> -		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
> -#else
>  	/*
>  	 * Memory barrier must be issued before setting the PG_uptodate bit,
>  	 * so that all previous stores issued in order to bring the page
>  	 * uptodate are actually visible before PageUptodate becomes true.
> -	 *
> -	 * s390 doesn't need an explicit smp_wmb here because the test and
> -	 * set bit already provides full barriers.
>  	 */
>  	smp_wmb();
>  	set_bit(PG_uptodate, &(page)->flags);
> -#endif
>  }
>  
>  CLEARPAGEFLAG(Uptodate, uptodate)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index face808..ef75a7d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1144,29 +1144,6 @@ void page_remove_rmap(struct page *page)
>  		goto out;
>  
>  	/*
> -	 * Now that the last pte has gone, s390 must transfer dirty
> -	 * flag from storage key to struct page.  We can usually skip
> -	 * this if the page is anon, so about to be freed; but perhaps
> -	 * not if it's in swapcache - there might be another pte slot
> -	 * containing the swap entry, but page not yet written to swap.
> -	 *
> -	 * And we can skip it on file pages, so long as the filesystem
> -	 * participates in dirty tracking (note that this is not only an
> -	 * optimization but also solves problems caused by dirty flag in
> -	 * storage key getting set by a write from inside kernel); but need to
> -	 * catch shm and tmpfs and ramfs pages which have been modified since
> -	 * creation by read fault.
> -	 *
> -	 * Note that mapping must be decided above, before decrementing
> -	 * mapcount (which luckily provides a barrier): once page is unmapped,
> -	 * it could be truncated and page->mapping reset to NULL at any moment.
> -	 * Note also that we are relying on page_mapping(page) to set mapping
> -	 * to &swapper_space when PageSwapCache(page).
> -	 */
> -	if (mapping && !mapping_cap_account_dirty(mapping) &&
> -	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> -		set_page_dirty(page);
> -	/*
>  	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>  	 * and not charged by memcg for now.
>  	 */
> -- 
> 1.7.12.4
> 
> -- 
> blue skies,
>    Martin.
> 
> "Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-25 20:01       ` Jan Kara
@ 2012-12-14  8:45         ` Martin Schwidefsky
  2012-12-17 23:31           ` Hugh Dickins
  0 siblings, 1 reply; 61+ messages in thread
From: Martin Schwidefsky @ 2012-12-14  8:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-mm, Mel Gorman, linux-s390, Hugh Dickins

On Thu, 25 Oct 2012 22:01:41 +0200
Jan Kara <jack@suse.cz> wrote:

> On Tue 23-10-12 14:56:36, Andrew Morton wrote:
> > On Tue, 23 Oct 2012 12:21:53 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > > That seems a fairly serious problem.  To which kernel version(s) should
> > > > we apply the fix?
> > >   Well, XFS will crash starting from 2.6.36 kernel where the assertion was
> > > added. Previously XFS just silently added buffers (as other filesystems do
> > > it) and wrote / redirtied the page (unnecessarily). So looking into
> > > maintained -stable branches I think pushing the patch to -stable from 3.0
> > > on should be enough.
> > 
> > OK, thanks, I made it so.
> > 
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > 
> > > > It's a bit surprising that none of the added comments mention the s390
> > > > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > > > I for one didn't know about this and it would be good if we could
> > > > capture the info _somewhere_?
> > >   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> > > updated. But do you mean recording somewhere the catch that s390 HW dirty
> > > bit gets set also whenever we write to a page from kernel?
> > 
> > Yes, this.  It's surprising behaviour which we may trip over again, so
> > how do we inform developers about it?
> > 
> > > I guess we could
> > > add that also to the comment before page_test_and_clear_dirty() in
> > > page_remove_rmap() and also before definition of
> > > page_test_and_clear_dirty(). So most people that will add / remove these
> > > calls will be warned. OK?
> > 
> > Sounds good, thanks.
>   OK, the patch is attached. As Martin says, it may be obsolete soon but just
> in case Martin's patch set gets delayed...
> 
> 								Honza

The patch got delayed a bit, the main issue is to get conclusive performance
measurements about the effects of the patch. I am pretty sure that the patch
works and will not cause any major degradation so it is time to ask for your
opinion. Here we go:
--
Subject: [PATCH] s390/mm: implement software dirty bits

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

The s390 architecture is unique in respect to dirty page detection,
it uses the change bit in the per-page storage key to track page
modifications. All other architectures track dirty bits by means
of page table entries. This property of s390 has caused numerous
problems in the past, e.g. see git commit ef5d437f71afdf4a
"mm: fix XFS oops due to dirty pages without buffers on s390".

To avoid future issues in regard to per-page dirty bits convert
s390 to a fault based software dirty bit detection mechanism. All
user page table entries which are marked as clean will be hardware
read-only, even if the pte is supposed to be writable. A write by
the user process will trigger a protection fault which will cause
the user pte to be marked as dirty and the hardware read-only bit
is removed.

With this change the dirty bit in the storage key is irrelevant
for Linux as a host, but the storage key is still required for
KVM guests. The effect is that page_test_and_clear_dirty and the
related code can be removed. The referenced bit in the storage
key is still used by the page_test_and_clear_young primitive to
provide page age information.

For page cache pages of mappings with mapping_cap_account_dirty
there will not be any change in behavior as the dirty bit tracking
already uses read-only ptes to control the amount of dirty pages.
Only for swap cache pages and pages of mappings without
mapping_cap_account_dirty there can be additional protection faults.
To avoid an excessive number of additional faults the mk_pte
primitive checks for PageDirty if the pgprot value allows for writes
and pre-dirties the pte. That avoids all additional faults for
tmpfs and shmem pages until these pages are added to the swap cache.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/page.h    |  22 -------
 arch/s390/include/asm/pgtable.h | 131 +++++++++++++++++++++++++++-------------
 arch/s390/include/asm/sclp.h    |   1 -
 arch/s390/include/asm/setup.h   |  16 ++---
 arch/s390/kvm/kvm-s390.c        |   2 +-
 arch/s390/lib/uaccess_pt.c      |   2 +-
 arch/s390/mm/pageattr.c         |   2 +-
 arch/s390/mm/vmem.c             |  24 +++-----
 drivers/s390/char/sclp_cmd.c    |  10 +--
 include/asm-generic/pgtable.h   |  10 ---
 include/linux/page-flags.h      |   8 ---
 mm/rmap.c                       |  23 -------
 12 files changed, 112 insertions(+), 139 deletions(-)

diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index a86ad40840..75ce9b0 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
 #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
 
 /*
- * Test and clear dirty bit in storage key.
- * We can't clear the changed bit atomically. This is a potential
- * race against modification of the referenced bit. This function
- * should therefore only be called if it is not mapped in any
- * address space.
- *
- * Note that the bit gets set whenever page content is changed. That means
- * also when the page is modified by DMA or from inside the kernel.
- */
-#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
-{
-	unsigned char skey;
-
-	skey = page_get_storage_key(pfn << PAGE_SHIFT);
-	if (!(skey & _PAGE_CHANGED))
-		return 0;
-	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
-	return 1;
-}
-
-/*
  * Test and clear referenced bit in storage key.
  */
 #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 33aeb77..66d3b2a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -29,6 +29,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/sched.h>
 #include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/bug.h>
 #include <asm/page.h>
 
@@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
 /* Software bits in the page table entry */
 #define _PAGE_SWT	0x001		/* SW pte type bit t */
 #define _PAGE_SWX	0x002		/* SW pte type bit x */
-#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
-#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
-#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
+#define _PAGE_SWC	0x004		/* SW pte changed bit */
+#define _PAGE_SWR	0x008		/* SW pte referenced bit */
+#define _PAGE_SWW	0x010		/* SW pte write bit */
+#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
 #define __HAVE_ARCH_PTE_SPECIAL
 
 /* Set of bits not changed in pte_modify */
-#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
+#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
+				 _PAGE_SWC | _PAGE_SWR)
 
 /* Six different types of pages. */
 #define _PAGE_TYPE_EMPTY	0x400
@@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
 
 /* Bits in the region table entry */
 #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
+#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
 #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
 #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
 #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
@@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
  */
 #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
 #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
-#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
+#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
+#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
 
-#define PAGE_KERNEL	PAGE_RW
+#define PAGE_KERNEL	PAGE_RWC
 #define PAGE_COPY	PAGE_RO
 
 /*
@@ -625,23 +630,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
 	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
 	/* Clear page changed & referenced bit in the storage key */
 	if (bits & _PAGE_CHANGED)
-		page_set_storage_key(address, skey ^ bits, 1);
+		page_set_storage_key(address, skey ^ bits, 0);
 	else if (bits)
 		page_reset_referenced(address);
 	/* Transfer page changed & referenced bit to guest bits in pgste */
 	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
 	/* Get host changed & referenced bits from pgste */
 	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
-	/* Clear host bits in pgste. */
+	/* Transfer page changed & referenced bit to kvm user bits */
+	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
+	/* Clear relevant host bits in pgste. */
 	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
 	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
 	/* Copy page access key and fetch protection bit to pgste */
 	pgste_val(pgste) |=
 		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
-	/* Transfer changed and referenced to kvm user bits */
-	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
-	/* Transfer changed & referenced to pte sofware bits */
-	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
+	/* Transfer referenced bit to pte */
+	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
 #endif
 	return pgste;
 
@@ -654,20 +659,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
 
 	if (!pte_present(*ptep))
 		return pgste;
+	/* Get referenced bit from storage key */
 	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
-	/* Transfer page referenced bit to pte software bit (host view) */
-	if (young || (pgste_val(pgste) & RCP_HR_BIT))
+	if (young)
+		pgste_val(pgste) |= RCP_GR_BIT;
+	/* Get host referenced bit from pgste */
+	if (pgste_val(pgste) & RCP_HR_BIT) {
+		pgste_val(pgste) &= ~RCP_HR_BIT;
+		young = 1;
+	}
+	/* Transfer referenced bit to kvm user bits and pte */
+	if (young) {
+		pgste_val(pgste) |= KVM_UR_BIT;
 		pte_val(*ptep) |= _PAGE_SWR;
-	/* Clear host referenced bit in pgste. */
-	pgste_val(pgste) &= ~RCP_HR_BIT;
-	/* Transfer page referenced bit to guest bit in pgste */
-	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
+	}
 #endif
 	return pgste;
-
 }
 
-static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
+static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
 {
 #ifdef CONFIG_PGSTE
 	unsigned long address;
@@ -681,10 +691,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
 	/* Set page access key and fetch protection bit from pgste */
 	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
 	if (okey != nkey)
-		page_set_storage_key(address, nkey, 1);
+		page_set_storage_key(address, nkey, 0);
 #endif
 }
 
+static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
+{
+	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
+		/*
+		 * Without enhanced suppression-on-protection force
+		 * the dirty bit on for all writable ptes.
+		 */
+		pte_val(entry) |= _PAGE_SWC;
+		pte_val(entry) &= ~_PAGE_RO;
+	}
+	*ptep = entry;
+}
+
 /**
  * struct gmap_struct - guest address space
  * @mm: pointer to the parent mm_struct
@@ -743,11 +766,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (mm_has_pgste(mm)) {
 		pgste = pgste_get_lock(ptep);
-		pgste_set_pte(ptep, pgste, entry);
-		*ptep = entry;
+		pgste_set_key(ptep, pgste, entry);
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
-	} else
+	} else {
+		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
+			pte_val(entry) |= _PAGE_CO;
 		*ptep = entry;
+	}
 }
 
 /*
@@ -756,16 +782,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
  */
 static inline int pte_write(pte_t pte)
 {
-	return (pte_val(pte) & _PAGE_RO) == 0;
+	return (pte_val(pte) & _PAGE_SWW) != 0;
 }
 
 static inline int pte_dirty(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
-	if (pte_val(pte) & _PAGE_SWC)
-		return 1;
-#endif
-	return 0;
+	return (pte_val(pte) & _PAGE_SWC) != 0;
 }
 
 static inline int pte_young(pte_t pte)
@@ -815,11 +837,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pte_val(pte) &= _PAGE_CHG_MASK;
 	pte_val(pte) |= pgprot_val(newprot);
+	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+	pte_val(pte) &= ~_PAGE_SWW;
 	/* Do not clobber _PAGE_TYPE_NONE pages!  */
 	if (!(pte_val(pte) & _PAGE_INVALID))
 		pte_val(pte) |= _PAGE_RO;
@@ -828,20 +853,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	pte_val(pte) &= ~_PAGE_RO;
+	pte_val(pte) |= _PAGE_SWW;
+	if (pte_val(pte) & _PAGE_SWC)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
 	pte_val(pte) &= ~_PAGE_SWC;
-#endif
+	/* Do not clobber _PAGE_TYPE_NONE pages!  */
+	if (!(pte_val(pte) & _PAGE_INVALID))
+		pte_val(pte) |= _PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+	pte_val(pte) |= _PAGE_SWC;
+	if (pte_val(pte) & _PAGE_SWW)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
@@ -879,10 +910,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
 		pte_val(pte) |= _SEGMENT_ENTRY_INV;
 	}
 	/*
-	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
-	 * table entry.
+	 * Clear SW pte bits, there are no SW bits in a segment table entry.
 	 */
-	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
+	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
+			  _PAGE_SWR | _PAGE_SWW);
 	/*
 	 * Also set the change-override bit because we don't need dirty bit
 	 * tracking for hugetlbfs pages.
@@ -1053,9 +1084,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 					   unsigned long address,
 					   pte_t *ptep, pte_t pte)
 {
-	*ptep = pte;
-	if (mm_has_pgste(mm))
+	if (mm_has_pgste(mm)) {
+		pgste_set_pte(ptep, pte);
 		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
+	} else
+		*ptep = pte;
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
@@ -1121,10 +1154,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
 			pgste = pgste_get_lock(ptep);
 
 		ptep_flush_lazy(mm, address, ptep);
-		*ptep = pte_wrprotect(pte);
+		pte = pte_wrprotect(pte);
 
-		if (mm_has_pgste(mm))
+		if (mm_has_pgste(mm)) {
+			pgste_set_pte(ptep, pte);
 			pgste_set_unlock(ptep, pgste);
+		} else
+			*ptep = pte;
 	}
 	return pte;
 }
@@ -1142,10 +1178,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 		pgste = pgste_get_lock(ptep);
 
 	__ptep_ipte(address, ptep);
-	*ptep = entry;
 
-	if (mm_has_pgste(vma->vm_mm))
+	if (mm_has_pgste(vma->vm_mm)) {
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
+	} else
+		*ptep = entry;
 	return 1;
 }
 
@@ -1163,8 +1201,13 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
 static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
 {
 	unsigned long physpage = page_to_phys(page);
+	pte_t __pte = mk_pte_phys(physpage, pgprot);
 
-	return mk_pte_phys(physpage, pgprot);
+	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
+		pte_val(__pte) |= _PAGE_SWC;
+		pte_val(__pte) &= ~_PAGE_RO;
+	}
+	return __pte;
 }
 
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
@@ -1256,6 +1299,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t entry)
 {
+	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
+		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
 	*pmdp = entry;
 }
 
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 8337886..06a1361 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
 void sclp_facilities_detect(void);
 unsigned long long sclp_get_rnmax(void);
 unsigned long long sclp_get_rzm(void);
-u8 sclp_get_fac85(void);
 int sclp_sdias_blk_count(void);
 int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
 int sclp_chp_configure(struct chp_id chpid);
diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
index f69f76b..f685751 100644
--- a/arch/s390/include/asm/setup.h
+++ b/arch/s390/include/asm/setup.h
@@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
 
 #define MACHINE_FLAG_VM		(1UL << 0)
 #define MACHINE_FLAG_IEEE	(1UL << 1)
-#define MACHINE_FLAG_CSP	(1UL << 3)
-#define MACHINE_FLAG_MVPG	(1UL << 4)
-#define MACHINE_FLAG_DIAG44	(1UL << 5)
-#define MACHINE_FLAG_IDTE	(1UL << 6)
-#define MACHINE_FLAG_DIAG9C	(1UL << 7)
-#define MACHINE_FLAG_MVCOS	(1UL << 8)
-#define MACHINE_FLAG_KVM	(1UL << 9)
+#define MACHINE_FLAG_CSP	(1UL << 2)
+#define MACHINE_FLAG_MVPG	(1UL << 3)
+#define MACHINE_FLAG_DIAG44	(1UL << 4)
+#define MACHINE_FLAG_IDTE	(1UL << 5)
+#define MACHINE_FLAG_DIAG9C	(1UL << 6)
+#define MACHINE_FLAG_MVCOS	(1UL << 7)
+#define MACHINE_FLAG_KVM	(1UL << 8)
+#define MACHINE_FLAG_ESOP	(1UL << 9)
 #define MACHINE_FLAG_EDAT1	(1UL << 10)
 #define MACHINE_FLAG_EDAT2	(1UL << 11)
 #define MACHINE_FLAG_LPAR	(1UL << 12)
@@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
 #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
 
 #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
+#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
 #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
 #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c9011bf..4659b62 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = KVM_MAX_VCPUS;
 		break;
 	case KVM_CAP_S390_COW:
-		r = sclp_get_fac85() & 0x2;
+		r = MACHINE_HAS_ESOP;
 		break;
 	default:
 		r = 0;
diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
index 9017a63..a70ee84 100644
--- a/arch/s390/lib/uaccess_pt.c
+++ b/arch/s390/lib/uaccess_pt.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
 	ptep = pte_offset_map(pmd, addr);
 	if (!pte_present(*ptep))
 		return -0x11UL;
-	if (write && !pte_write(*ptep))
+	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
 		return -0x04UL;
 
 	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 29ccee3..d21040e 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
 			pte_val(*pte) = _PAGE_TYPE_EMPTY;
 			continue;
 		}
-		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
+		pte_val(*pte) = __pa(address);
 	}
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..79699f46 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	while (address < end) {
-		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
 		pg_dir = pgd_offset_k(address);
 		if (pgd_none(*pg_dir)) {
 			pu_dir = vmem_pud_alloc();
@@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
 		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
-			pte_val(pte) |= _REGION3_ENTRY_LARGE;
-			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
-			pud_val(*pu_dir) = pte_val(pte);
+			pud_val(*pu_dir) = __pa(address) |
+				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
+				(ro ? _REGION_ENTRY_RO : 0);
 			address += PUD_SIZE;
 			continue;
 		}
@@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
 		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
-			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-			pmd_val(*pm_dir) = pte_val(pte);
+			pmd_val(*pm_dir) = __pa(address) |
+				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
+				(ro ? _SEGMENT_ENTRY_RO : 0);
 			address += PMD_SIZE;
 			continue;
 		}
@@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 		}
 
 		pt_dir = pte_offset_kernel(pm_dir, address);
-		*pt_dir = pte;
+		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
 		address += PAGE_SIZE;
 	}
 	ret = 0;
@@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	start_addr = (unsigned long) start;
@@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 				new_page = vmemmap_alloc_block(PMD_SIZE, node);
 				if (!new_page)
 					goto out;
-				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
-				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-				pmd_val(*pm_dir) = pte_val(pte);
+				pmd_val(*pm_dir) = __pa(new_page) |
+					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
 				address = (address + PMD_SIZE) & PMD_MASK;
 				continue;
 			}
@@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 			new_page =__pa(vmem_alloc_pages(0));
 			if (!new_page)
 				goto out;
-			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
-			*pt_dir = pte;
+			pte_val(*pt_dir) = __pa(new_page);
 		}
 		address += PAGE_SIZE;
 	}
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index c44d13f..30a2255 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
 
 u64 sclp_facilities;
 static u8 sclp_fac84;
-static u8 sclp_fac85;
 static unsigned long long rzm;
 static unsigned long long rnmax;
 
@@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
 	sccb = &early_read_info_sccb;
 	sclp_facilities = sccb->facilities;
 	sclp_fac84 = sccb->fac84;
-	sclp_fac85 = sccb->fac85;
+	if (sccb->fac85 & 0x02)
+		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
 	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
 	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
 	rzm <<= 20;
@@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
 	return rzm;
 }
 
-u8 sclp_get_fac85(void)
-{
-	return sclp_fac85;
-}
-EXPORT_SYMBOL_GPL(sclp_get_fac85);
-
 /*
  * This function will be called after sclp_facilities_detect(), which gets
  * called from early.c code. Therefore the sccb should have valid contents.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 83b54ed..bdd7fac 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define page_test_and_clear_dirty(pfn, mapped)	(0)
-#endif
-
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define pte_maybe_dirty(pte)		pte_dirty(pte)
-#else
-#define pte_maybe_dirty(pte)		(1)
-#endif
-
 #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
 #define page_test_and_clear_young(pfn) (0)
 #endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..4c0c8eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
 
 static inline void SetPageUptodate(struct page *page)
 {
-#ifdef CONFIG_S390
-	if (!test_and_set_bit(PG_uptodate, &page->flags))
-		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
-#else
 	/*
 	 * Memory barrier must be issued before setting the PG_uptodate bit,
 	 * so that all previous stores issued in order to bring the page
 	 * uptodate are actually visible before PageUptodate becomes true.
-	 *
-	 * s390 doesn't need an explicit smp_wmb here because the test and
-	 * set bit already provides full barriers.
 	 */
 	smp_wmb();
 	set_bit(PG_uptodate, &(page)->flags);
-#endif
 }
 
 CLEARPAGEFLAG(Uptodate, uptodate)
diff --git a/mm/rmap.c b/mm/rmap.c
index face808..ef75a7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1144,29 +1144,6 @@ void page_remove_rmap(struct page *page)
 		goto out;
 
 	/*
-	 * Now that the last pte has gone, s390 must transfer dirty
-	 * flag from storage key to struct page.  We can usually skip
-	 * this if the page is anon, so about to be freed; but perhaps
-	 * not if it's in swapcache - there might be another pte slot
-	 * containing the swap entry, but page not yet written to swap.
-	 *
-	 * And we can skip it on file pages, so long as the filesystem
-	 * participates in dirty tracking (note that this is not only an
-	 * optimization but also solves problems caused by dirty flag in
-	 * storage key getting set by a write from inside kernel); but need to
-	 * catch shm and tmpfs and ramfs pages which have been modified since
-	 * creation by read fault.
-	 *
-	 * Note that mapping must be decided above, before decrementing
-	 * mapcount (which luckily provides a barrier): once page is unmapped,
-	 * it could be truncated and page->mapping reset to NULL at any moment.
-	 * Note also that we are relying on page_mapping(page) to set mapping
-	 * to &swapper_space when PageSwapCache(page).
-	 */
-	if (mapping && !mapping_cap_account_dirty(mapping) &&
-	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
-	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
-- 
1.7.12.4

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-23 21:56     ` Andrew Morton
  2012-10-24  8:30       ` Martin Schwidefsky
@ 2012-10-25 20:01       ` Jan Kara
  2012-12-14  8:45         ` Martin Schwidefsky
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Kara @ 2012-10-25 20:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

On Tue 23-10-12 14:56:36, Andrew Morton wrote:
> On Tue, 23 Oct 2012 12:21:53 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > > That seems a fairly serious problem.  To which kernel version(s) should
> > > we apply the fix?
> >   Well, XFS will crash starting from 2.6.36 kernel where the assertion was
> > added. Previously XFS just silently added buffers (as other filesystems do
> > it) and wrote / redirtied the page (unnecessarily). So looking into
> > maintained -stable branches I think pushing the patch to -stable from 3.0
> > on should be enough.
> 
> OK, thanks, I made it so.
> 
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > 
> > > It's a bit surprising that none of the added comments mention the s390
> > > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > > I for one didn't know about this and it would be good if we could
> > > capture the info _somewhere_?
> >   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> > updated. But do you mean recording somewhere the catch that s390 HW dirty
> > bit gets set also whenever we write to a page from kernel?
> 
> Yes, this.  It's surprising behaviour which we may trip over again, so
> how do we inform developers about it?
> 
> > I guess we could
> > add that also to the comment before page_test_and_clear_dirty() in
> > page_remove_rmap() and also before definition of
> > page_test_and_clear_dirty(). So most people that will add / remove these
> > calls will be warned. OK?
> 
> Sounds good, thanks.
  OK, the patch is attached. As Martin says, it may be obsolete soon but just
in case Martin's patch set gets delayed...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Comment-on-storage-key-dirty-bit-semantics.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-23 21:56     ` Andrew Morton
@ 2012-10-24  8:30       ` Martin Schwidefsky
  2012-10-25 20:01       ` Jan Kara
  1 sibling, 0 replies; 61+ messages in thread
From: Martin Schwidefsky @ 2012-10-24  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-mm, Mel Gorman, linux-s390, Hugh Dickins

On Tue, 23 Oct 2012 14:56:36 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Oct 2012 12:21:53 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > 
> > > It's a bit surprising that none of the added comments mention the s390
> > > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > > I for one didn't know about this and it would be good if we could
> > > capture the info _somewhere_?
> >   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> > updated. But do you mean recording somewhere the catch that s390 HW dirty
> > bit gets set also whenever we write to a page from kernel?
> 
> Yes, this.  It's surprising behaviour which we may trip over again, so
> how do we inform developers about it?

That is what I worry about as well. It is not the first time we tripped over
the per-page dirty bit and I guess it won't be the last time. Therefore I
created a patch to switch s390 over to fault based dirty bits, the sneak
performance test are promising. If we do not find any major performance
degradation this would be my preferred way to fix this problem for good.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-23 10:21   ` Jan Kara
@ 2012-10-23 21:56     ` Andrew Morton
  2012-10-24  8:30       ` Martin Schwidefsky
  2012-10-25 20:01       ` Jan Kara
  0 siblings, 2 replies; 61+ messages in thread
From: Andrew Morton @ 2012-10-23 21:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390, Hugh Dickins

On Tue, 23 Oct 2012 12:21:53 +0200
Jan Kara <jack@suse.cz> wrote:

> > That seems a fairly serious problem.  To which kernel version(s) should
> > we apply the fix?
>   Well, XFS will crash starting from 2.6.36 kernel where the assertion was
> added. Previously XFS just silently added buffers (as other filesystems do
> it) and wrote / redirtied the page (unnecessarily). So looking into
> maintained -stable branches I think pushing the patch to -stable from 3.0
> on should be enough.

OK, thanks, I made it so.

> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > 
> > It's a bit surprising that none of the added comments mention the s390
> > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > I for one didn't know about this and it would be good if we could
> > capture the info _somewhere_?
>   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> updated. But do you mean recording somewhere the catch that s390 HW dirty
> bit gets set also whenever we write to a page from kernel?

Yes, this.  It's surprising behaviour which we may trip over again, so
how do we inform developers about it?

> I guess we could
> add that also to the comment before page_test_and_clear_dirty() in
> page_remove_rmap() and also before definition of
> page_test_and_clear_dirty(). So most people that will add / remove these
> calls will be warned. OK?

Sounds good, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-22 19:38 ` Andrew Morton
  2012-10-23  4:40   ` Hugh Dickins
@ 2012-10-23 10:21   ` Jan Kara
  2012-10-23 21:56     ` Andrew Morton
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Kara @ 2012-10-23 10:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

On Mon 22-10-12 12:38:52, Andrew Morton wrote:
> On Mon, 22 Oct 2012 17:06:46 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via buffered write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
> > with mapping_cap_account_dirty(). This is safe because for such mappings when a
> > page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> > 
> > Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
> > mapping_cap_account_dirty() for things to work and proposing a cleaned up
> > variant of the patch.
> > 
> > The patch has survived about two hours of running fsx-linux on tmpfs while
> > heavily swapping and several days of running on out build machines where the
> > original problem was triggered.
> 
> That seems a fairly serious problem.  To which kernel version(s) should
> we apply the fix?
  Well, XFS will crash starting from 2.6.36 kernel where the assertion was
added. Previously XFS just silently added buffers (as other filesystems do
it) and wrote / redirtied the page (unnecessarily). So looking into
maintained -stable branches I think pushing the patch to -stable from 3.0
on should be enough.

> > diff --git a/mm/rmap.c b/mm/rmap.c
> 
> It's a bit surprising that none of the added comments mention the s390
> pte-dirtying oddity.  I don't see an obvious place to mention this, but
> I for one didn't know about this and it would be good if we could
> capture the info _somewhere_?
  As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
updated. But do you mean recording somewhere the catch that s390 HW dirty
bit gets set also whenever we write to a page from kernel? I guess we could
add that also to the comment before page_test_and_clear_dirty() in
page_remove_rmap() and also before definition of
page_test_and_clear_dirty(). So most people that will add / remove these
calls will be warned. OK?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-22 19:38 ` Andrew Morton
@ 2012-10-23  4:40   ` Hugh Dickins
  2012-10-23 10:21   ` Jan Kara
  1 sibling, 0 replies; 61+ messages in thread
From: Hugh Dickins @ 2012-10-23  4:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390

On Mon, 22 Oct 2012, Andrew Morton wrote:
> On Mon, 22 Oct 2012 17:06:46 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via buffered write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
> > with mapping_cap_account_dirty(). This is safe because for such mappings when a
> > page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> > 
> > Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
> > mapping_cap_account_dirty() for things to work and proposing a cleaned up
> > variant of the patch.
> > 
> > The patch has survived about two hours of running fsx-linux on tmpfs while
> > heavily swapping and several days of running on out build machines where the
> > original problem was triggered.
> 
> That seems a fairly serious problem.  To which kernel version(s) should
> we apply the fix?

That I'll leave Jan and/or Martin to answer.

> 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> 
> It's a bit surprising that none of the added comments mention the s390
> pte-dirtying oddity.  I don't see an obvious place to mention this, but
> I for one didn't know about this and it would be good if we could
> capture the info _somewhere_?

I think it's okay: the comment you can see in Jan's patch is extending
this existing comment in page_remove_rmap(), that I added sometime in
the past (largely because "page_test_and_clear_dirty" sounds so
magisterially generic, when in actuality it's specific to s390):

	/*
	 * Now that the last pte has gone, s390 must transfer dirty
	 * flag from storage key to struct page.  We can usually skip
	 * this if the page is anon, so about to be freed; but perhaps
	 * not if it's in swapcache - there might be another pte slot
	 * containing the swap entry, but page not yet written to swap.
	 */

And one of the delights of Jan's patch is that it removes the other
callsite completely.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-22 15:06 Jan Kara
@ 2012-10-22 19:38 ` Andrew Morton
  2012-10-23  4:40   ` Hugh Dickins
  2012-10-23 10:21   ` Jan Kara
  0 siblings, 2 replies; 61+ messages in thread
From: Andrew Morton @ 2012-10-22 19:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390, Hugh Dickins

On Mon, 22 Oct 2012 17:06:46 +0200
Jan Kara <jack@suse.cz> wrote:

> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via buffered write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
> with mapping_cap_account_dirty(). This is safe because for such mappings when a
> page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
> 
> Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
> mapping_cap_account_dirty() for things to work and proposing a cleaned up
> variant of the patch.
> 
> The patch has survived about two hours of running fsx-linux on tmpfs while
> heavily swapping and several days of running on out build machines where the
> original problem was triggered.

That seems a fairly serious problem.  To which kernel version(s) should
we apply the fix?

> diff --git a/mm/rmap.c b/mm/rmap.c

It's a bit surprising that none of the added comments mention the s390
pte-dirtying oddity.  I don't see an obvious place to mention this, but
I for one didn't know about this and it would be good if we could
capture the info _somewhere_?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-22 15:06 Jan Kara
  2012-10-22 19:38 ` Andrew Morton
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Kara @ 2012-10-22 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Jan Kara, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via buffered write, HW
dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
filesystems. The bug we observed in practice is that buffers from the page get
freed, so when the page gets later marked as dirty and writeback writes it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
from xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then page
unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
with mapping_cap_account_dirty(). This is safe because for such mappings when a
page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
the page gets writeprotected in page_mkclean(). So pagecache page is writeable
if and only if it is dirty.

Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
mapping_cap_account_dirty() for things to work and proposing a cleaned up
variant of the patch.

The patch has survived about two hours of running fsx-linux on tmpfs while
heavily swapping and several days of running on out build machines where the
original problem was triggered.

CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-s390@vger.kernel.org
CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/rmap.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7df7984..2ee1ef0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
+#include <linux/backing-dev.h>

 #include <asm/tlbflush.h>

@@ -926,11 +927,8 @@ int page_mkclean(struct page *page)

 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
-		if (mapping) {
+		if (mapping)
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
-				ret = 1;
-		}
 	}

 	return ret;
@@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *page)
  */
 void page_remove_rmap(struct page *page)
 {
+	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 *
+	 * And we can skip it on file pages, so long as the filesystem
+	 * participates in dirty tracking; but need to catch shm and tmpfs
+	 * and ramfs pages which have been modified since creation by read
+	 * fault.
+	 *
+	 * Note that mapping must be decided above, before decrementing
+	 * mapcount (which luckily provides a barrier): once page is unmapped,
+	 * it could be truncated and page->mapping reset to NULL at any moment.
+	 * Note also that we are relying on page_mapping(page) to set mapping
+	 * to &swapper_space when PageSwapCache(page).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (mapping && !mapping_cap_account_dirty(mapping) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2012-12-18  7:30 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-01 16:26 [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390 Jan Kara
2012-10-01 16:26 ` Jan Kara
2012-10-01 16:26 ` Jan Kara
2012-10-08 14:28 ` Mel Gorman
2012-10-08 14:28   ` Mel Gorman
2012-10-08 14:28   ` Mel Gorman
2012-10-09  4:24 ` Hugh Dickins
2012-10-09  4:24   ` Hugh Dickins
2012-10-09  4:24   ` Hugh Dickins
2012-10-09  8:18   ` Martin Schwidefsky
2012-10-09  8:18     ` Martin Schwidefsky
2012-10-09  8:18     ` Martin Schwidefsky
2012-10-09 23:21     ` Hugh Dickins
2012-10-09 23:21       ` Hugh Dickins
2012-10-09 23:21       ` Hugh Dickins
2012-10-10 21:57       ` Hugh Dickins
2012-10-10 21:57         ` Hugh Dickins
2012-10-10 21:57         ` Hugh Dickins
2012-10-19 14:38       ` Martin Schwidefsky
2012-10-19 14:38         ` Martin Schwidefsky
2012-10-19 14:38         ` Martin Schwidefsky
2012-10-09  9:32   ` Mel Gorman
2012-10-09  9:32     ` Mel Gorman
2012-10-09  9:32     ` Mel Gorman
2012-10-09 23:00     ` Hugh Dickins
2012-10-09 23:00       ` Hugh Dickins
2012-10-09 23:00       ` Hugh Dickins
2012-10-09 16:21   ` Jan Kara
2012-10-09 16:21     ` Jan Kara
2012-10-09 16:21     ` Jan Kara
2012-10-10  2:19     ` Hugh Dickins
2012-10-10  2:19       ` Hugh Dickins
2012-10-10  2:19       ` Hugh Dickins
2012-10-10  8:55       ` Jan Kara
2012-10-10  8:55         ` Jan Kara
2012-10-10  8:55         ` Jan Kara
2012-10-10 21:28         ` Hugh Dickins
2012-10-10 21:28           ` Hugh Dickins
2012-10-10 21:28           ` Hugh Dickins
2012-10-11  7:42           ` Martin Schwidefsky
2012-10-11  7:42             ` Martin Schwidefsky
2012-10-11  7:42             ` Martin Schwidefsky
2012-10-10 21:56       ` Dave Chinner
2012-10-10 21:56         ` Dave Chinner
2012-10-10 21:56         ` Dave Chinner
2012-10-11  7:44         ` Martin Schwidefsky
2012-10-11  7:44           ` Martin Schwidefsky
2012-10-11  7:44           ` Martin Schwidefsky
2012-10-17  0:43       ` Jan Kara
2012-10-17  0:43         ` Jan Kara
2012-10-17  0:43         ` Jan Kara
2012-10-22 15:06 Jan Kara
2012-10-22 19:38 ` Andrew Morton
2012-10-23  4:40   ` Hugh Dickins
2012-10-23 10:21   ` Jan Kara
2012-10-23 21:56     ` Andrew Morton
2012-10-24  8:30       ` Martin Schwidefsky
2012-10-25 20:01       ` Jan Kara
2012-12-14  8:45         ` Martin Schwidefsky
2012-12-17 23:31           ` Hugh Dickins
2012-12-18  7:30             ` Martin Schwidefsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.