All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-16 14:14 ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-16 14:14 UTC (permalink / raw)
  To: Martin Schwidefsky, Heiko Carstens
  Cc: Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

This patch is horribly ugly and there has to be a better way of doing
it. I'm looking for suggestions on what s390 can do here that is not
painful or broken. 

The following bug was reported on s390

kernel BUG at
/usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
illegal operation: 0001 [#1] SMP
Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
Supported: Yes
CPU: 3 Not tainted 3.0.13-0.27-default #1
Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
Krnl PSW : 0404000180000000 000000000037935e
(radix_tree_tag_set+0xfe/0x118)
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
           0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
           0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
           000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
Krnl Code: 0000000000379352: a7cafffa           ahi     %r12,-6
           0000000000379356: a7f4ffb5           brc     15,3792c0
           000000000037935a: a7f40001           brc     15,37935c
          >000000000037935e: a7f40000           brc     15,37935e
           0000000000379362: b90200bb           ltgr    %r11,%r11
           0000000000379366: a784fff0           brc     8,379346
           000000000037936a: a7f4ffe1           brc     15,37932c
           000000000037936e: a7f40001           brc     15,379370
Call Trace:
([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
 [<0000000000234970>] page_remove_rmap+0x100/0x104
 [<0000000000228608>] zap_pte_range+0x194/0x608
 [<0000000000228caa>] unmap_page_range+0x22e/0x33c
 [<0000000000228ec2>] unmap_vmas+0x10a/0x274
 [<000000000022f2f6>] exit_mmap+0xd2/0x254
 [<00000000001472d6>] mmput+0x5e/0x144
 [<000000000014d306>] exit_mm+0x196/0x1d4
 [<000000000014f650>] do_exit+0x18c/0x408
 [<000000000014f92c>] do_group_exit+0x60/0xec
 [<000000000014f9ea>] SyS_exit_group+0x32/0x40
 [<00000000004e1660>] sysc_noemu+0x16/0x1c
 [<000003fffd23ab96>] 0x3fffd23ab96
Last Breaking-Event-Address:
 [<000000000037935a>] radix_tree_tag_set+0xfa/0x118

While this bug was reproduced on a 3.0.x kernel, there is no reason why
it should not happen in mainline.

The bug was triggered because a page had a valid mapping but by the time
set_page_dirty() was called there was no valid entry in the radix tree.
This was reproduced while underlying storage was unplugged but this may
be indirectly related to the problem.

This bug only triggers on s390 and may be explained by a race. Normally
when pages are being readahead in read_cache_pages(), an attempt is made
to add the page to the page cache. If that fails, the page is invalidated
by locking it, giving it a valid mapping, calling do_invalidatepage()
and then setting mapping to NULL again. It's similar with page cache is
being deleted. The page is locked, the tree lock is taken, it's removed
from the radix tree and the mapping is set to NULL.

In many cases looking up the radix tree is protected by a combination of
the page lock and the tree lock. When zapping pages, the page lock is not
taken which does not matter as most architectures do nothing special with
the mapping and for file-backed pages, the mapping should be pinned by
the open file. However, s390 also calls set_dirty_page() for
PageSwapCache() pages where it is potentially racing against
reuse_swap_page() for example.

This patch splits the propagation of the storage key into a separate
function and introduces a new variant of page_remove_rmap() called
page_remove_rmap_nolock() which is only used during page table zapping
that attempts to acquire the page lock. The approach is ugly although it
works in that the bug can no longer be triggered. At the time the page
lock is required, the PTL is held so it cannot sleep so it busy waits on
trylock_page(). That potentially deadlocks against reclaim which holds
the page table lock and is trying to acquire the pagetable lock so in
some cases there is no choice except to race. In the file-backed case,
this is ok because the address space will be valid (zap_pte_range() makes
the same assumption. However, s390 needs a better way of guarding against
PageSwapCache pages being removed from the radix tree while set_page_dirty()
is being called. The patch would be marginally better if in the PageSwapCache
case we simply tried to lock once and in the contended case just fail to
propogate the storage key. I lack familiarity with the s390 architecture
to be certain if this is safe or not. Suggestions on a better fix?

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/rmap.h |    1 +
 mm/memory.c          |    2 +-
 mm/rmap.c            |  109 +++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 96 insertions(+), 16 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..59146af 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -142,6 +142,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
 void page_add_file_rmap(struct page *);
 void page_remove_rmap(struct page *);
+void page_remove_rmap_nolock(struct page *);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/mm/memory.c b/mm/memory.c
index f1d788e..4f1bf96 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1184,7 +1184,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap_nolock(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			force_flush = !__tlb_remove_page(tlb, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6546da9..9d2279b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,29 +1114,81 @@ void page_add_file_rmap(struct page *page)
 	}
 }
 
-/**
- * page_remove_rmap - take down pte mapping from a page
- * @page: page to remove mapping from
+/*
+ * When the late PTE has gone, s390 must transfer the dirty flag from the
+ * storage key to struct page. We can usually skip this if the page is anon,
+ * so about to be freed; but perhaps not if it's in swapcache - there might
+ * be another pte slot containing the swap entry, but page not yet written to
+ * swap.
  *
- * The caller needs to hold the pte lock.
+ * set_page_dirty() is called while the page_mapcount is still postive and
+ * under the page lock to avoid races with the mapping being invalidated.
  */
-void page_remove_rmap(struct page *page)
+static void propogate_storage_key(struct page *page, bool lock_required)
+{
+	if (page_mapcount(page) == 1 &&
+			(!PageAnon(page) || PageSwapCache(page)) &&
+	    		page_test_and_clear_dirty(page_to_pfn(page), 1)) {
+		if (lock_required) {
+			bool locked;
+
+			/*
+			 * This is called from zap_pte_range which holds the
+			 * PTL. The page lock is normally needed to avoid
+			 * truncation races and cannot sleep as a result.
+			 * During zap_pte_range, it is assumed that the open
+			 * file pins the mapping and a check is made under the
+			 * tree_lock. The same does not hold true for SwapCache
+			 * pages because it may be getting reused.
+			 *
+			 * Ideally we would take the page lock but in this
+			 * context the PTL is held so we can't sleep. We
+			 * are also potentially contending with processes
+			 * in reclaim context that hold the page lock but
+			 * are trying to acquire the PTL which leads to
+			 * an ABBA deadlock.
+			 *
+			 * Hence this resulting mess. s390 needs a better
+			 * way to guard against races. Suggestions?
+			 */
+			while ((locked = trylock_page(page)) == false) {
+				cpu_relax();
+
+				if (need_resched())
+					break;
+			}
+
+			/* If the page is locked, it's safe to call set_page_dirty */
+			if (locked) {
+				set_page_dirty(page);
+				unlock_page(page);
+			} else {
+
+				/*
+				 * For swap cache pages, we have no real choice
+				 * except to lose the storage key information
+				 * and expect the new user to preserve dirty
+				 * information.
+				 *
+				 * For file-backed pages, the open file should
+				 * be enough to pin the mapping similar which
+				 * is also assumed by zap_pte_range().
+				 */
+				if (WARN_ON_ONCE(!PageSwapCache(page)))
+					set_page_dirty(page);
+			}
+		} else
+			set_page_dirty(page);
+	}
+}
+
+static void __page_remove_rmap(struct page *page)
 {
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		return;
 
 	/*
-	 * Now that the last pte has gone, s390 must transfer dirty
-	 * flag from storage key to struct page.  We can usually skip
-	 * this if the page is anon, so about to be freed; but perhaps
-	 * not if it's in swapcache - there might be another pte slot
-	 * containing the swap entry, but page not yet written to swap.
-	 */
-	if ((!PageAnon(page) || PageSwapCache(page)) &&
-	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
-	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
@@ -1164,6 +1216,33 @@ void page_remove_rmap(struct page *page)
 	 */
 }
 
+/**
+ * page_remove_rmap - take down pte mapping from an unlocked page
+ * @page: page to remove mapping from
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_remove_rmap(struct page *page)
+{
+	propogate_storage_key(page, false);
+	__page_remove_rmap(page);
+}
+
+/**
+ * page_remove_rmap_nolock - take down pte mapping where the caller does not have the mapping pinned
+ * @page: page to remove mapping from
+ *
+ * The caller needs to hold the PTE lock and is called from a context where
+ * the page is neither locked nor the mapping->host pinned. On s390 in this
+ * case the page lock will be taken to pin the mapping if the page needs to
+ * be set dirty.
+ */
+void page_remove_rmap_nolock(struct page *page)
+{
+	propogate_storage_key(page, true);
+	__page_remove_rmap(page);
+}
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-16 14:14 ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-16 14:14 UTC (permalink / raw)
  To: Martin Schwidefsky, Heiko Carstens
  Cc: Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

This patch is horribly ugly and there has to be a better way of doing
it. I'm looking for suggestions on what s390 can do here that is not
painful or broken. 

The following bug was reported on s390

kernel BUG at
/usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
illegal operation: 0001 [#1] SMP
Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
Supported: Yes
CPU: 3 Not tainted 3.0.13-0.27-default #1
Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
Krnl PSW : 0404000180000000 000000000037935e
(radix_tree_tag_set+0xfe/0x118)
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
           0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
           0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
           000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
Krnl Code: 0000000000379352: a7cafffa           ahi     %r12,-6
           0000000000379356: a7f4ffb5           brc     15,3792c0
           000000000037935a: a7f40001           brc     15,37935c
          >000000000037935e: a7f40000           brc     15,37935e
           0000000000379362: b90200bb           ltgr    %r11,%r11
           0000000000379366: a784fff0           brc     8,379346
           000000000037936a: a7f4ffe1           brc     15,37932c
           000000000037936e: a7f40001           brc     15,379370
Call Trace:
([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
 [<0000000000234970>] page_remove_rmap+0x100/0x104
 [<0000000000228608>] zap_pte_range+0x194/0x608
 [<0000000000228caa>] unmap_page_range+0x22e/0x33c
 [<0000000000228ec2>] unmap_vmas+0x10a/0x274
 [<000000000022f2f6>] exit_mmap+0xd2/0x254
 [<00000000001472d6>] mmput+0x5e/0x144
 [<000000000014d306>] exit_mm+0x196/0x1d4
 [<000000000014f650>] do_exit+0x18c/0x408
 [<000000000014f92c>] do_group_exit+0x60/0xec
 [<000000000014f9ea>] SyS_exit_group+0x32/0x40
 [<00000000004e1660>] sysc_noemu+0x16/0x1c
 [<000003fffd23ab96>] 0x3fffd23ab96
Last Breaking-Event-Address:
 [<000000000037935a>] radix_tree_tag_set+0xfa/0x118

While this bug was reproduced on a 3.0.x kernel, there is no reason why
it should not happen in mainline.

The bug was triggered because a page had a valid mapping but by the time
set_page_dirty() was called there was no valid entry in the radix tree.
This was reproduced while underlying storage was unplugged but this may
be indirectly related to the problem.

This bug only triggers on s390 and may be explained by a race. Normally
when pages are being readahead in read_cache_pages(), an attempt is made
to add the page to the page cache. If that fails, the page is invalidated
by locking it, giving it a valid mapping, calling do_invalidatepage()
and then setting mapping to NULL again. It's similar with page cache is
being deleted. The page is locked, the tree lock is taken, it's removed
from the radix tree and the mapping is set to NULL.

In many cases looking up the radix tree is protected by a combination of
the page lock and the tree lock. When zapping pages, the page lock is not
taken which does not matter as most architectures do nothing special with
the mapping and for file-backed pages, the mapping should be pinned by
the open file. However, s390 also calls set_dirty_page() for
PageSwapCache() pages where it is potentially racing against
reuse_swap_page() for example.

This patch splits the propagation of the storage key into a separate
function and introduces a new variant of page_remove_rmap() called
page_remove_rmap_nolock() which is only used during page table zapping
that attempts to acquire the page lock. The approach is ugly although it
works in that the bug can no longer be triggered. At the time the page
lock is required, the PTL is held so it cannot sleep so it busy waits on
trylock_page(). That potentially deadlocks against reclaim which holds
the page table lock and is trying to acquire the pagetable lock so in
some cases there is no choice except to race. In the file-backed case,
this is ok because the address space will be valid (zap_pte_range() makes
the same assumption. However, s390 needs a better way of guarding against
PageSwapCache pages being removed from the radix tree while set_page_dirty()
is being called. The patch would be marginally better if in the PageSwapCache
case we simply tried to lock once and in the contended case just fail to
propogate the storage key. I lack familiarity with the s390 architecture
to be certain if this is safe or not. Suggestions on a better fix?

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/rmap.h |    1 +
 mm/memory.c          |    2 +-
 mm/rmap.c            |  109 +++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 96 insertions(+), 16 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..59146af 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -142,6 +142,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
 void page_add_file_rmap(struct page *);
 void page_remove_rmap(struct page *);
+void page_remove_rmap_nolock(struct page *);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/mm/memory.c b/mm/memory.c
index f1d788e..4f1bf96 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1184,7 +1184,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap_nolock(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			force_flush = !__tlb_remove_page(tlb, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6546da9..9d2279b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,29 +1114,81 @@ void page_add_file_rmap(struct page *page)
 	}
 }
 
-/**
- * page_remove_rmap - take down pte mapping from a page
- * @page: page to remove mapping from
+/*
+ * When the late PTE has gone, s390 must transfer the dirty flag from the
+ * storage key to struct page. We can usually skip this if the page is anon,
+ * so about to be freed; but perhaps not if it's in swapcache - there might
+ * be another pte slot containing the swap entry, but page not yet written to
+ * swap.
  *
- * The caller needs to hold the pte lock.
+ * set_page_dirty() is called while the page_mapcount is still postive and
+ * under the page lock to avoid races with the mapping being invalidated.
  */
-void page_remove_rmap(struct page *page)
+static void propogate_storage_key(struct page *page, bool lock_required)
+{
+	if (page_mapcount(page) == 1 &&
+			(!PageAnon(page) || PageSwapCache(page)) &&
+	    		page_test_and_clear_dirty(page_to_pfn(page), 1)) {
+		if (lock_required) {
+			bool locked;
+
+			/*
+			 * This is called from zap_pte_range which holds the
+			 * PTL. The page lock is normally needed to avoid
+			 * truncation races and cannot sleep as a result.
+			 * During zap_pte_range, it is assumed that the open
+			 * file pins the mapping and a check is made under the
+			 * tree_lock. The same does not hold true for SwapCache
+			 * pages because it may be getting reused.
+			 *
+			 * Ideally we would take the page lock but in this
+			 * context the PTL is held so we can't sleep. We
+			 * are also potentially contending with processes
+			 * in reclaim context that hold the page lock but
+			 * are trying to acquire the PTL which leads to
+			 * an ABBA deadlock.
+			 *
+			 * Hence this resulting mess. s390 needs a better
+			 * way to guard against races. Suggestions?
+			 */
+			while ((locked = trylock_page(page)) == false) {
+				cpu_relax();
+
+				if (need_resched())
+					break;
+			}
+
+			/* If the page is locked, it's safe to call set_page_dirty */
+			if (locked) {
+				set_page_dirty(page);
+				unlock_page(page);
+			} else {
+
+				/*
+				 * For swap cache pages, we have no real choice
+				 * except to lose the storage key information
+				 * and expect the new user to preserve dirty
+				 * information.
+				 *
+				 * For file-backed pages, the open file should
+				 * be enough to pin the mapping similar which
+				 * is also assumed by zap_pte_range().
+				 */
+				if (WARN_ON_ONCE(!PageSwapCache(page)))
+					set_page_dirty(page);
+			}
+		} else
+			set_page_dirty(page);
+	}
+}
+
+static void __page_remove_rmap(struct page *page)
 {
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		return;
 
 	/*
-	 * Now that the last pte has gone, s390 must transfer dirty
-	 * flag from storage key to struct page.  We can usually skip
-	 * this if the page is anon, so about to be freed; but perhaps
-	 * not if it's in swapcache - there might be another pte slot
-	 * containing the swap entry, but page not yet written to swap.
-	 */
-	if ((!PageAnon(page) || PageSwapCache(page)) &&
-	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
-	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
@@ -1164,6 +1216,33 @@ void page_remove_rmap(struct page *page)
 	 */
 }
 
+/**
+ * page_remove_rmap - take down pte mapping from an unlocked page
+ * @page: page to remove mapping from
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_remove_rmap(struct page *page)
+{
+	propogate_storage_key(page, false);
+	__page_remove_rmap(page);
+}
+
+/**
+ * page_remove_rmap_nolock - take down pte mapping where the caller does not have the mapping pinned
+ * @page: page to remove mapping from
+ *
+ * The caller needs to hold the PTE lock and is called from a context where
+ * the page is neither locked nor the mapping->host pinned. On s390 in this
+ * case the page lock will be taken to pin the mapping if the page needs to
+ * be set dirty.
+ */
+void page_remove_rmap_nolock(struct page *page)
+{
+	propogate_storage_key(page, true);
+	__page_remove_rmap(page);
+}
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-16 14:14 ` Mel Gorman
@ 2012-04-16 14:53   ` Rik van Riel
  -1 siblings, 0 replies; 32+ messages in thread
From: Rik van Riel @ 2012-04-16 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Hugh Dickins, Linux-MM,
	Linux-S390, LKML

On 04/16/2012 10:14 AM, Mel Gorman wrote:
> This patch is horribly ugly and there has to be a better way of doing
> it. I'm looking for suggestions on what s390 can do here that is not
> painful or broken.

I'm hoping the S390 arch maintainers have an idea.

Ugly or not, we'll need something to fix the bug.

> + * When the late PTE has gone, s390 must transfer the dirty flag from the
> + * storage key to struct page. We can usually skip this if the page is anon,
> + * so about to be freed; but perhaps not if it's in swapcache - there might
> + * be another pte slot containing the swap entry, but page not yet written to
> + * swap.
>    *
> - * The caller needs to hold the pte lock.
> + * set_page_dirty() is called while the page_mapcount is still postive and
> + * under the page lock to avoid races with the mapping being invalidated.
>    */
> -void page_remove_rmap(struct page *page)
> +static void propogate_storage_key(struct page *page, bool lock_required)

Do you mean "propAgate" ?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-16 14:53   ` Rik van Riel
  0 siblings, 0 replies; 32+ messages in thread
From: Rik van Riel @ 2012-04-16 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Hugh Dickins, Linux-MM,
	Linux-S390, LKML

On 04/16/2012 10:14 AM, Mel Gorman wrote:
> This patch is horribly ugly and there has to be a better way of doing
> it. I'm looking for suggestions on what s390 can do here that is not
> painful or broken.

I'm hoping the S390 arch maintainers have an idea.

Ugly or not, we'll need something to fix the bug.

> + * When the late PTE has gone, s390 must transfer the dirty flag from the
> + * storage key to struct page. We can usually skip this if the page is anon,
> + * so about to be freed; but perhaps not if it's in swapcache - there might
> + * be another pte slot containing the swap entry, but page not yet written to
> + * swap.
>    *
> - * The caller needs to hold the pte lock.
> + * set_page_dirty() is called while the page_mapcount is still postive and
> + * under the page lock to avoid races with the mapping being invalidated.
>    */
> -void page_remove_rmap(struct page *page)
> +static void propogate_storage_key(struct page *page, bool lock_required)

Do you mean "propAgate" ?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-16 14:53   ` Rik van Riel
@ 2012-04-16 15:02     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-16 15:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Martin Schwidefsky, Heiko Carstens, Hugh Dickins, Linux-MM,
	Linux-S390, LKML

On Mon, Apr 16, 2012 at 10:53:07AM -0400, Rik van Riel wrote:
> On 04/16/2012 10:14 AM, Mel Gorman wrote:
> >This patch is horribly ugly and there has to be a better way of doing
> >it. I'm looking for suggestions on what s390 can do here that is not
> >painful or broken.
> 
> I'm hoping the S390 arch maintainers have an idea.
> 
> Ugly or not, we'll need something to fix the bug.
> 

Indeed.

> >+ * When the late PTE has gone, s390 must transfer the dirty flag from the
> >+ * storage key to struct page. We can usually skip this if the page is anon,
> >+ * so about to be freed; but perhaps not if it's in swapcache - there might
> >+ * be another pte slot containing the swap entry, but page not yet written to
> >+ * swap.
> >   *
> >- * The caller needs to hold the pte lock.
> >+ * set_page_dirty() is called while the page_mapcount is still postive and
> >+ * under the page lock to avoid races with the mapping being invalidated.
> >   */
> >-void page_remove_rmap(struct page *page)
> >+static void propogate_storage_key(struct page *page, bool lock_required)
> 
> Do you mean "propAgate" ?
> 

Yes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-16 15:02     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-16 15:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Martin Schwidefsky, Heiko Carstens, Hugh Dickins, Linux-MM,
	Linux-S390, LKML

On Mon, Apr 16, 2012 at 10:53:07AM -0400, Rik van Riel wrote:
> On 04/16/2012 10:14 AM, Mel Gorman wrote:
> >This patch is horribly ugly and there has to be a better way of doing
> >it. I'm looking for suggestions on what s390 can do here that is not
> >painful or broken.
> 
> I'm hoping the S390 arch maintainers have an idea.
> 
> Ugly or not, we'll need something to fix the bug.
> 

Indeed.

> >+ * When the late PTE has gone, s390 must transfer the dirty flag from the
> >+ * storage key to struct page. We can usually skip this if the page is anon,
> >+ * so about to be freed; but perhaps not if it's in swapcache - there might
> >+ * be another pte slot containing the swap entry, but page not yet written to
> >+ * swap.
> >   *
> >- * The caller needs to hold the pte lock.
> >+ * set_page_dirty() is called while the page_mapcount is still postive and
> >+ * under the page lock to avoid races with the mapping being invalidated.
> >   */
> >-void page_remove_rmap(struct page *page)
> >+static void propogate_storage_key(struct page *page, bool lock_required)
> 
> Do you mean "propAgate" ?
> 

Yes.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-16 14:14 ` Mel Gorman
@ 2012-04-16 15:50   ` Martin Schwidefsky
  -1 siblings, 0 replies; 32+ messages in thread
From: Martin Schwidefsky @ 2012-04-16 15:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Heiko Carstens, Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

On Mon, 16 Apr 2012 15:14:23 +0100
Mel Gorman <mgorman@suse.de> wrote:

> This patch is horribly ugly and there has to be a better way of doing
> it. I'm looking for suggestions on what s390 can do here that is not
> painful or broken. 
> 
> However, s390 needs a better way of guarding against
> PageSwapCache pages being removed from the radix tree while set_page_dirty()
> is being called. The patch would be marginally better if in the PageSwapCache
> case we simply tried to lock once and in the contended case just fail to
> propogate the storage key. I lack familiarity with the s390 architecture
> to be certain if this is safe or not. Suggestions on a better fix?

One though that crossed my mind is that maybe a better approach would be
to move the page_test_and_clear_dirty check out of page_remove_rmap.
What we need to look out for are code sequences of the form:

	if (pte_dirty(pte))
		set_page_dirty(page);
	...
	page_remove_rmap(page);

There are four of those as far as I can see: in try_to_unmap_one,
try_to_unmap_cluster, zap_pte, and zap_pte_range.

A valid implementation for s390 would be to test and clear the changed
bit in the storage key for every of those pte_dirty() calls.

	if (pte_dirty(pte) || page_test_and_clear_dirty(page))
		set_page_dirty(page);
	...
	page_remove_rmap(page); /* w/o page_test_clear_dirty */

Trouble is that the ISKE and SSKE instructions are very expensive, that
is why we currently have the operation in page_remove_rmap after the
map counter dropped to zero (which is wrong as we now have learned the
hard way). The additional check for (!PageAnon || PageSwapCache) is
just another variation of avoiding ISKE/SSKE.

Thinking about a function like this:

static inline int page_test_dirty_eco(struct page *page)
{
	if (page_mapcount(page) > 1)
		return 0;
	if (PageAnon(page) && !PageSwapCache(page))
		return 0;
	return page_test_and_clear_dirty(page);
}

and use it alongside the pte_dirty() check. The worry I have is the
map counter. What guarantees us that the map counter is not decremented
concurrently? Which is probably a problem with the current patch as
well, checking atomic_add_negative(-1, &page->_mapcount) against zero
works, doing (page_mapcount(page) == 1) followed by the decrement 
can race. And we better not forget a dirty bit ..

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-16 15:50   ` Martin Schwidefsky
  0 siblings, 0 replies; 32+ messages in thread
From: Martin Schwidefsky @ 2012-04-16 15:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Heiko Carstens, Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

On Mon, 16 Apr 2012 15:14:23 +0100
Mel Gorman <mgorman@suse.de> wrote:

> This patch is horribly ugly and there has to be a better way of doing
> it. I'm looking for suggestions on what s390 can do here that is not
> painful or broken. 
> 
> However, s390 needs a better way of guarding against
> PageSwapCache pages being removed from the radix tree while set_page_dirty()
> is being called. The patch would be marginally better if in the PageSwapCache
> case we simply tried to lock once and in the contended case just fail to
> propogate the storage key. I lack familiarity with the s390 architecture
> to be certain if this is safe or not. Suggestions on a better fix?

One though that crossed my mind is that maybe a better approach would be
to move the page_test_and_clear_dirty check out of page_remove_rmap.
What we need to look out for are code sequences of the form:

	if (pte_dirty(pte))
		set_page_dirty(page);
	...
	page_remove_rmap(page);

There are four of those as far as I can see: in try_to_unmap_one,
try_to_unmap_cluster, zap_pte, and zap_pte_range.

A valid implementation for s390 would be to test and clear the changed
bit in the storage key for every of those pte_dirty() calls.

	if (pte_dirty(pte) || page_test_and_clear_dirty(page))
		set_page_dirty(page);
	...
	page_remove_rmap(page); /* w/o page_test_clear_dirty */

Trouble is that the ISKE and SSKE instructions are very expensive, that
is why we currently have the operation in page_remove_rmap after the
map counter dropped to zero (which is wrong as we now have learned the
hard way). The additional check for (!PageAnon || PageSwapCache) is
just another variation of avoiding ISKE/SSKE.

Thinking about a function like this:

static inline int page_test_dirty_eco(struct page *page)
{
	if (page_mapcount(page) > 1)
		return 0;
	if (PageAnon(page) && !PageSwapCache(page))
		return 0;
	return page_test_and_clear_dirty(page);
}

and use it alongside the pte_dirty() check. The worry I have is the
map counter. What guarantees us that the map counter is not decremented
concurrently? Which is probably a problem with the current patch as
well, checking atomic_add_negative(-1, &page->_mapcount) against zero
works, doing (page_mapcount(page) == 1) followed by the decrement 
can race. And we better not forget a dirty bit ..

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-16 14:14 ` Mel Gorman
@ 2012-04-16 21:14   ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-16 21:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Linux-MM,
	Linux-S390, LKML

On Mon, 16 Apr 2012, Mel Gorman wrote:

> This patch is horribly ugly and there has to be a better way of doing
> it. I'm looking for suggestions on what s390 can do here that is not
> painful or broken. 
> 
> The following bug was reported on s390
> 
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
> illegal operation: 0001 [#1] SMP
> Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
> loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
> ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
> scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
> Supported: Yes
> CPU: 3 Not tainted 3.0.13-0.27-default #1
> Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
> Krnl PSW : 0404000180000000 000000000037935e
> (radix_tree_tag_set+0xfe/0x118)
>            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
>            0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
>            0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
>            000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
> Krnl Code: 0000000000379352: a7cafffa           ahi     %r12,-6
>            0000000000379356: a7f4ffb5           brc     15,3792c0
>            000000000037935a: a7f40001           brc     15,37935c
>           >000000000037935e: a7f40000           brc     15,37935e
>            0000000000379362: b90200bb           ltgr    %r11,%r11
>            0000000000379366: a784fff0           brc     8,379346
>            000000000037936a: a7f4ffe1           brc     15,37932c
>            000000000037936e: a7f40001           brc     15,379370
> Call Trace:
> ([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
>  [<0000000000234970>] page_remove_rmap+0x100/0x104
>  [<0000000000228608>] zap_pte_range+0x194/0x608
>  [<0000000000228caa>] unmap_page_range+0x22e/0x33c
>  [<0000000000228ec2>] unmap_vmas+0x10a/0x274
>  [<000000000022f2f6>] exit_mmap+0xd2/0x254
>  [<00000000001472d6>] mmput+0x5e/0x144
>  [<000000000014d306>] exit_mm+0x196/0x1d4
>  [<000000000014f650>] do_exit+0x18c/0x408
>  [<000000000014f92c>] do_group_exit+0x60/0xec
>  [<000000000014f9ea>] SyS_exit_group+0x32/0x40
>  [<00000000004e1660>] sysc_noemu+0x16/0x1c
>  [<000003fffd23ab96>] 0x3fffd23ab96
> Last Breaking-Event-Address:
>  [<000000000037935a>] radix_tree_tag_set+0xfa/0x118
> 
> While this bug was reproduced on a 3.0.x kernel, there is no reason why
> it should not happen in mainline.
> 
> The bug was triggered because a page had a valid mapping but by the time
> set_page_dirty() was called there was no valid entry in the radix tree.
> This was reproduced while underlying storage was unplugged but this may
> be indirectly related to the problem.
> 
> This bug only triggers on s390 and may be explained by a race. Normally
> when pages are being readahead in read_cache_pages(), an attempt is made
> to add the page to the page cache. If that fails, the page is invalidated
> by locking it, giving it a valid mapping, calling do_invalidatepage()
> and then setting mapping to NULL again. It's similar with page cache is
> being deleted. The page is locked, the tree lock is taken, it's removed
> from the radix tree and the mapping is set to NULL.
> 
> In many cases looking up the radix tree is protected by a combination of
> the page lock and the tree lock. When zapping pages, the page lock is not
> taken which does not matter as most architectures do nothing special with
> the mapping and for file-backed pages, the mapping should be pinned by
> the open file. However, s390 also calls set_dirty_page() for
> PageSwapCache() pages where it is potentially racing against
> reuse_swap_page() for example.
> 
> This patch splits the propagation of the storage key into a separate
> function and introduces a new variant of page_remove_rmap() called
> page_remove_rmap_nolock() which is only used during page table zapping
> that attempts to acquire the page lock. The approach is ugly although it
> works in that the bug can no longer be triggered. At the time the page
> lock is required, the PTL is held so it cannot sleep so it busy waits on
> trylock_page(). That potentially deadlocks against reclaim which holds
> the page table lock and is trying to acquire the pagetable lock so in
> some cases there is no choice except to race. In the file-backed case,
> this is ok because the address space will be valid (zap_pte_range() makes
> the same assumption. However, s390 needs a better way of guarding against
> PageSwapCache pages being removed from the radix tree while set_page_dirty()
> is being called. The patch would be marginally better if in the PageSwapCache
> case we simply tried to lock once and in the contended case just fail to
> propogate the storage key. I lack familiarity with the s390 architecture
> to be certain if this is safe or not. Suggestions on a better fix?
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I'm very confused - and wonder if you are too ;)

Please forgive me if I've not been patient enough in reading through
all you've written, and knocking it into my skull: ask me to go back
and read again.

But it seems to me that you're approaching this from the wrong end
(page_remove_rmap), instead of from the end that's giving rise to the
problem - perhaps because you've not yet identified that other end?

I'm confused as to whether you see this problem with file pages,
or with anon-swap-cache pages, or with both, or not yet determined.

The obvious places which set page->mapping = NULL or ClearPageSwapCache,
once they've been visible to the outside world, are careful to do so at
the same time as removing the radix_tree entry, holding mapping->tree_lock.

And __set_page_dirty_nobuffers(), the set_page_dirty variant implicated
in the trace above, is careful to recheck mapping under tree_lock, as I
think are the others.  There is an issue if mapping itself could vanish
while in there, but that's not a danger in page_remove_rmap().

(You do remind me that I meant years ago to switch swapper_space over
to the much simpler __set_page_dirty_no_writeback(), which shmem has
used for ages; but as far as this problem goes, that would probably
be at best a workaround, rather than the proper fix.)

You indicate read_cache_pages_invalidate_page() above, and yes that's
weird the way it sets and unsets page->mapping; I never came across it
before, but IIUC it's safe because those pages have not yet been made
visible to the outside world.  (Or do we have a speculative pagecache
issue here?)

You compare do_invalidatepage() with deletion from page cache, but I hope
that's mistaken: aren't such pages usually being invalidated precisely
because they raced with pages found already there in the page cache,
which need to be left in place?

So, is there somewhere which is setting page->mapping = NULL, or
doing ClearPageSwapCache, in a separate block from removing that
page's entry from the radix_tree?  If there is, then I think it
could pose a problem on more than s390.

Hmm, mm/migrate.c.

Hugh

> ---
>  include/linux/rmap.h |    1 +
>  mm/memory.c          |    2 +-
>  mm/rmap.c            |  109 +++++++++++++++++++++++++++++++++++++++++++-------
>  3 files changed, 96 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..59146af 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -142,6 +142,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
>  void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
>  void page_add_file_rmap(struct page *);
>  void page_remove_rmap(struct page *);
> +void page_remove_rmap_nolock(struct page *);
>  
>  void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>  			    unsigned long);
> diff --git a/mm/memory.c b/mm/memory.c
> index f1d788e..4f1bf96 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1184,7 +1184,7 @@ again:
>  					mark_page_accessed(page);
>  				rss[MM_FILEPAGES]--;
>  			}
> -			page_remove_rmap(page);
> +			page_remove_rmap_nolock(page);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			force_flush = !__tlb_remove_page(tlb, page);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6546da9..9d2279b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1114,29 +1114,81 @@ void page_add_file_rmap(struct page *page)
>  	}
>  }
>  
> -/**
> - * page_remove_rmap - take down pte mapping from a page
> - * @page: page to remove mapping from
> +/*
> + * When the late PTE has gone, s390 must transfer the dirty flag from the
> + * storage key to struct page. We can usually skip this if the page is anon,
> + * so about to be freed; but perhaps not if it's in swapcache - there might
> + * be another pte slot containing the swap entry, but page not yet written to
> + * swap.
>   *
> - * The caller needs to hold the pte lock.
> + * set_page_dirty() is called while the page_mapcount is still postive and
> + * under the page lock to avoid races with the mapping being invalidated.
>   */
> -void page_remove_rmap(struct page *page)
> +static void propogate_storage_key(struct page *page, bool lock_required)
> +{
> +	if (page_mapcount(page) == 1 &&
> +			(!PageAnon(page) || PageSwapCache(page)) &&
> +	    		page_test_and_clear_dirty(page_to_pfn(page), 1)) {
> +		if (lock_required) {
> +			bool locked;
> +
> +			/*
> +			 * This is called from zap_pte_range which holds the
> +			 * PTL. The page lock is normally needed to avoid
> +			 * truncation races and cannot sleep as a result.
> +			 * During zap_pte_range, it is assumed that the open
> +			 * file pins the mapping and a check is made under the
> +			 * tree_lock. The same does not hold true for SwapCache
> +			 * pages because it may be getting reused.
> +			 *
> +			 * Ideally we would take the page lock but in this
> +			 * context the PTL is held so we can't sleep. We
> +			 * are also potentially contending with processes
> +			 * in reclaim context that hold the page lock but
> +			 * are trying to acquire the PTL which leads to
> +			 * an ABBA deadlock.
> +			 *
> +			 * Hence this resulting mess. s390 needs a better
> +			 * way to guard against races. Suggestions?
> +			 */
> +			while ((locked = trylock_page(page)) == false) {
> +				cpu_relax();
> +
> +				if (need_resched())
> +					break;
> +			}
> +
> +			/* If the page is locked, it's safe to call set_page_dirty */
> +			if (locked) {
> +				set_page_dirty(page);
> +				unlock_page(page);
> +			} else {
> +
> +				/*
> +				 * For swap cache pages, we have no real choice
> +				 * except to lose the storage key information
> +				 * and expect the new user to preserve dirty
> +				 * information.
> +				 *
> +				 * For file-backed pages, the open file should
> +				 * be enough to pin the mapping similar which
> +				 * is also assumed by zap_pte_range().
> +				 */
> +				if (WARN_ON_ONCE(!PageSwapCache(page)))
> +					set_page_dirty(page);
> +			}
> +		} else
> +			set_page_dirty(page);
> +	}
> +}
> +
> +static void __page_remove_rmap(struct page *page)
>  {
>  	/* page still mapped by someone else? */
>  	if (!atomic_add_negative(-1, &page->_mapcount))
>  		return;
>  
>  	/*
> -	 * Now that the last pte has gone, s390 must transfer dirty
> -	 * flag from storage key to struct page.  We can usually skip
> -	 * this if the page is anon, so about to be freed; but perhaps
> -	 * not if it's in swapcache - there might be another pte slot
> -	 * containing the swap entry, but page not yet written to swap.
> -	 */
> -	if ((!PageAnon(page) || PageSwapCache(page)) &&
> -	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> -		set_page_dirty(page);
> -	/*
>  	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>  	 * and not charged by memcg for now.
>  	 */
> @@ -1164,6 +1216,33 @@ void page_remove_rmap(struct page *page)
>  	 */
>  }
>  
> +/**
> + * page_remove_rmap - take down pte mapping from an unlocked page
> + * @page: page to remove mapping from
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void page_remove_rmap(struct page *page)
> +{
> +	propogate_storage_key(page, false);
> +	__page_remove_rmap(page);
> +}
> +
> +/**
> + * page_remove_rmap_nolock - take down pte mapping where the caller does not have the mapping pinned
> + * @page: page to remove mapping from
> + *
> + * The caller needs to hold the PTE lock and is called from a context where
> + * the page is neither locked nor the mapping->host pinned. On s390 in this
> + * case the page lock will be taken to pin the mapping if the page needs to
> + * be set dirty.
> + */
> +void page_remove_rmap_nolock(struct page *page)
> +{
> +	propogate_storage_key(page, true);
> +	__page_remove_rmap(page);
> +}
> +
>  /*
>   * Subfunctions of try_to_unmap: try_to_unmap_one called
>   * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-16 21:14   ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-16 21:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Linux-MM,
	Linux-S390, LKML

On Mon, 16 Apr 2012, Mel Gorman wrote:

> This patch is horribly ugly and there has to be a better way of doing
> it. I'm looking for suggestions on what s390 can do here that is not
> painful or broken. 
> 
> The following bug was reported on s390
> 
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
> illegal operation: 0001 [#1] SMP
> Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
> loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
> ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
> scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
> Supported: Yes
> CPU: 3 Not tainted 3.0.13-0.27-default #1
> Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
> Krnl PSW : 0404000180000000 000000000037935e
> (radix_tree_tag_set+0xfe/0x118)
>            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
>            0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
>            0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
>            000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
> Krnl Code: 0000000000379352: a7cafffa           ahi     %r12,-6
>            0000000000379356: a7f4ffb5           brc     15,3792c0
>            000000000037935a: a7f40001           brc     15,37935c
>           >000000000037935e: a7f40000           brc     15,37935e
>            0000000000379362: b90200bb           ltgr    %r11,%r11
>            0000000000379366: a784fff0           brc     8,379346
>            000000000037936a: a7f4ffe1           brc     15,37932c
>            000000000037936e: a7f40001           brc     15,379370
> Call Trace:
> ([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
>  [<0000000000234970>] page_remove_rmap+0x100/0x104
>  [<0000000000228608>] zap_pte_range+0x194/0x608
>  [<0000000000228caa>] unmap_page_range+0x22e/0x33c
>  [<0000000000228ec2>] unmap_vmas+0x10a/0x274
>  [<000000000022f2f6>] exit_mmap+0xd2/0x254
>  [<00000000001472d6>] mmput+0x5e/0x144
>  [<000000000014d306>] exit_mm+0x196/0x1d4
>  [<000000000014f650>] do_exit+0x18c/0x408
>  [<000000000014f92c>] do_group_exit+0x60/0xec
>  [<000000000014f9ea>] SyS_exit_group+0x32/0x40
>  [<00000000004e1660>] sysc_noemu+0x16/0x1c
>  [<000003fffd23ab96>] 0x3fffd23ab96
> Last Breaking-Event-Address:
>  [<000000000037935a>] radix_tree_tag_set+0xfa/0x118
> 
> While this bug was reproduced on a 3.0.x kernel, there is no reason why
> it should not happen in mainline.
> 
> The bug was triggered because a page had a valid mapping but by the time
> set_page_dirty() was called there was no valid entry in the radix tree.
> This was reproduced while underlying storage was unplugged but this may
> be indirectly related to the problem.
> 
> This bug only triggers on s390 and may be explained by a race. Normally
> when pages are being readahead in read_cache_pages(), an attempt is made
> to add the page to the page cache. If that fails, the page is invalidated
> by locking it, giving it a valid mapping, calling do_invalidatepage()
> and then setting mapping to NULL again. It's similar with page cache is
> being deleted. The page is locked, the tree lock is taken, it's removed
> from the radix tree and the mapping is set to NULL.
> 
> In many cases looking up the radix tree is protected by a combination of
> the page lock and the tree lock. When zapping pages, the page lock is not
> taken which does not matter as most architectures do nothing special with
> the mapping and for file-backed pages, the mapping should be pinned by
> the open file. However, s390 also calls set_dirty_page() for
> PageSwapCache() pages where it is potentially racing against
> reuse_swap_page() for example.
> 
> This patch splits the propagation of the storage key into a separate
> function and introduces a new variant of page_remove_rmap() called
> page_remove_rmap_nolock() which is only used during page table zapping
> that attempts to acquire the page lock. The approach is ugly although it
> works in that the bug can no longer be triggered. At the time the page
> lock is required, the PTL is held so it cannot sleep so it busy waits on
> trylock_page(). That potentially deadlocks against reclaim which holds
> the page table lock and is trying to acquire the pagetable lock so in
> some cases there is no choice except to race. In the file-backed case,
> this is ok because the address space will be valid (zap_pte_range() makes
> the same assumption. However, s390 needs a better way of guarding against
> PageSwapCache pages being removed from the radix tree while set_page_dirty()
> is being called. The patch would be marginally better if in the PageSwapCache
> case we simply tried to lock once and in the contended case just fail to
> propogate the storage key. I lack familiarity with the s390 architecture
> to be certain if this is safe or not. Suggestions on a better fix?
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I'm very confused - and wonder if you are too ;)

Please forgive me if I've not been patient enough in reading through
all you've written, and knocking it into my skull: ask me to go back
and read again.

But it seems to me that you're approaching this from the wrong end
(page_remove_rmap), instead of from the end that's giving rise to the
problem - perhaps because you've not yet identified that other end?

I'm confused as to whether you see this problem with file pages,
or with anon-swap-cache pages, or with both, or not yet determined.

The obvious places which set page->mapping = NULL or ClearPageSwapCache,
once they've been visible to the outside world, are careful to do so at
the same time as removing the radix_tree entry, holding mapping->tree_lock.

And __set_page_dirty_nobuffers(), the set_page_dirty variant implicated
in the trace above, is careful to recheck mapping under tree_lock, as I
think are the others.  There is an issue if mapping itself could vanish
while in there, but that's not a danger in page_remove_rmap().

(You do remind me that I meant years ago to switch swapper_space over
to the much simpler __set_page_dirty_no_writeback(), which shmem has
used for ages; but as far as this problem goes, that would probably
be at best a workaround, rather than the proper fix.)

You indicate read_cache_pages_invalidate_page() above, and yes that's
weird the way it sets and unsets page->mapping; I never came across it
before, but IIUC it's safe because those pages have not yet been made
visible to the outside world.  (Or do we have a speculative pagecache
issue here?)

You compare do_invalidatepage() with deletion from page cache, but I hope
that's mistaken: aren't such pages usually being invalidated precisely
because they raced with pages found already there in the page cache,
which need to be left in place?

So, is there somewhere which is setting page->mapping = NULL, or
doing ClearPageSwapCache, in a separate block from removing that
page's entry from the radix_tree?  If there is, then I think it
could pose a problem on more than s390.

Hmm, mm/migrate.c.

Hugh

> ---
>  include/linux/rmap.h |    1 +
>  mm/memory.c          |    2 +-
>  mm/rmap.c            |  109 +++++++++++++++++++++++++++++++++++++++++++-------
>  3 files changed, 96 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..59146af 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -142,6 +142,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
>  void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
>  void page_add_file_rmap(struct page *);
>  void page_remove_rmap(struct page *);
> +void page_remove_rmap_nolock(struct page *);
>  
>  void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>  			    unsigned long);
> diff --git a/mm/memory.c b/mm/memory.c
> index f1d788e..4f1bf96 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1184,7 +1184,7 @@ again:
>  					mark_page_accessed(page);
>  				rss[MM_FILEPAGES]--;
>  			}
> -			page_remove_rmap(page);
> +			page_remove_rmap_nolock(page);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			force_flush = !__tlb_remove_page(tlb, page);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6546da9..9d2279b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1114,29 +1114,81 @@ void page_add_file_rmap(struct page *page)
>  	}
>  }
>  
> -/**
> - * page_remove_rmap - take down pte mapping from a page
> - * @page: page to remove mapping from
> +/*
> + * When the late PTE has gone, s390 must transfer the dirty flag from the
> + * storage key to struct page. We can usually skip this if the page is anon,
> + * so about to be freed; but perhaps not if it's in swapcache - there might
> + * be another pte slot containing the swap entry, but page not yet written to
> + * swap.
>   *
> - * The caller needs to hold the pte lock.
> + * set_page_dirty() is called while the page_mapcount is still postive and
> + * under the page lock to avoid races with the mapping being invalidated.
>   */
> -void page_remove_rmap(struct page *page)
> +static void propogate_storage_key(struct page *page, bool lock_required)
> +{
> +	if (page_mapcount(page) == 1 &&
> +			(!PageAnon(page) || PageSwapCache(page)) &&
> +	    		page_test_and_clear_dirty(page_to_pfn(page), 1)) {
> +		if (lock_required) {
> +			bool locked;
> +
> +			/*
> +			 * This is called from zap_pte_range which holds the
> +			 * PTL. The page lock is normally needed to avoid
> +			 * truncation races and cannot sleep as a result.
> +			 * During zap_pte_range, it is assumed that the open
> +			 * file pins the mapping and a check is made under the
> +			 * tree_lock. The same does not hold true for SwapCache
> +			 * pages because it may be getting reused.
> +			 *
> +			 * Ideally we would take the page lock but in this
> +			 * context the PTL is held so we can't sleep. We
> +			 * are also potentially contending with processes
> +			 * in reclaim context that hold the page lock but
> +			 * are trying to acquire the PTL which leads to
> +			 * an ABBA deadlock.
> +			 *
> +			 * Hence this resulting mess. s390 needs a better
> +			 * way to guard against races. Suggestions?
> +			 */
> +			while ((locked = trylock_page(page)) == false) {
> +				cpu_relax();
> +
> +				if (need_resched())
> +					break;
> +			}
> +
> +			/* If the page is locked, it's safe to call set_page_dirty */
> +			if (locked) {
> +				set_page_dirty(page);
> +				unlock_page(page);
> +			} else {
> +
> +				/*
> +				 * For swap cache pages, we have no real choice
> +				 * except to lose the storage key information
> +				 * and expect the new user to preserve dirty
> +				 * information.
> +				 *
> +				 * For file-backed pages, the open file should
> +				 * be enough to pin the mapping similar which
> +				 * is also assumed by zap_pte_range().
> +				 */
> +				if (WARN_ON_ONCE(!PageSwapCache(page)))
> +					set_page_dirty(page);
> +			}
> +		} else
> +			set_page_dirty(page);
> +	}
> +}
> +
> +static void __page_remove_rmap(struct page *page)
>  {
>  	/* page still mapped by someone else? */
>  	if (!atomic_add_negative(-1, &page->_mapcount))
>  		return;
>  
>  	/*
> -	 * Now that the last pte has gone, s390 must transfer dirty
> -	 * flag from storage key to struct page.  We can usually skip
> -	 * this if the page is anon, so about to be freed; but perhaps
> -	 * not if it's in swapcache - there might be another pte slot
> -	 * containing the swap entry, but page not yet written to swap.
> -	 */
> -	if ((!PageAnon(page) || PageSwapCache(page)) &&
> -	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> -		set_page_dirty(page);
> -	/*
>  	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>  	 * and not charged by memcg for now.
>  	 */
> @@ -1164,6 +1216,33 @@ void page_remove_rmap(struct page *page)
>  	 */
>  }
>  
> +/**
> + * page_remove_rmap - take down pte mapping from an unlocked page
> + * @page: page to remove mapping from
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void page_remove_rmap(struct page *page)
> +{
> +	propogate_storage_key(page, false);
> +	__page_remove_rmap(page);
> +}
> +
> +/**
> + * page_remove_rmap_nolock - take down pte mapping where the caller does not have the mapping pinned
> + * @page: page to remove mapping from
> + *
> + * The caller needs to hold the PTE lock and is called from a context where
> + * the page is neither locked nor the mapping->host pinned. On s390 in this
> + * case the page lock will be taken to pin the mapping if the page needs to
> + * be set dirty.
> + */
> +void page_remove_rmap_nolock(struct page *page)
> +{
> +	propogate_storage_key(page, true);
> +	__page_remove_rmap(page);
> +}
> +
>  /*
>   * Subfunctions of try_to_unmap: try_to_unmap_one called
>   * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-16 21:14   ` Hugh Dickins
@ 2012-04-17 12:22     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-17 12:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Linux-MM,
	Linux-S390, LKML

On Mon, Apr 16, 2012 at 02:14:09PM -0700, Hugh Dickins wrote:
> On Mon, 16 Apr 2012, Mel Gorman wrote:
> 
> > This patch is horribly ugly and there has to be a better way of doing
> > it. I'm looking for suggestions on what s390 can do here that is not
> > painful or broken. 
> > 
> > The following bug was reported on s390
> > 
> > kernel BUG at
> > /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
> > illegal operation: 0001 [#1] SMP
> > Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
> > loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
> > ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
> > scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
> > Supported: Yes
> > CPU: 3 Not tainted 3.0.13-0.27-default #1
> > Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
> > Krnl PSW : 0404000180000000 000000000037935e
> > (radix_tree_tag_set+0xfe/0x118)
> >            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
> >            0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
> >            0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
> >            000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
> > Krnl Code: 0000000000379352: a7cafffa           ahi     %r12,-6
> >            0000000000379356: a7f4ffb5           brc     15,3792c0
> >            000000000037935a: a7f40001           brc     15,37935c
> >           >000000000037935e: a7f40000           brc     15,37935e
> >            0000000000379362: b90200bb           ltgr    %r11,%r11
> >            0000000000379366: a784fff0           brc     8,379346
> >            000000000037936a: a7f4ffe1           brc     15,37932c
> >            000000000037936e: a7f40001           brc     15,379370
> > Call Trace:
> > ([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
> >  [<0000000000234970>] page_remove_rmap+0x100/0x104
> >  [<0000000000228608>] zap_pte_range+0x194/0x608
> >  [<0000000000228caa>] unmap_page_range+0x22e/0x33c
> >  [<0000000000228ec2>] unmap_vmas+0x10a/0x274
> >  [<000000000022f2f6>] exit_mmap+0xd2/0x254
> >  [<00000000001472d6>] mmput+0x5e/0x144
> >  [<000000000014d306>] exit_mm+0x196/0x1d4
> >  [<000000000014f650>] do_exit+0x18c/0x408
> >  [<000000000014f92c>] do_group_exit+0x60/0xec
> >  [<000000000014f9ea>] SyS_exit_group+0x32/0x40
> >  [<00000000004e1660>] sysc_noemu+0x16/0x1c
> >  [<000003fffd23ab96>] 0x3fffd23ab96
> > Last Breaking-Event-Address:
> >  [<000000000037935a>] radix_tree_tag_set+0xfa/0x118
> > 
> > While this bug was reproduced on a 3.0.x kernel, there is no reason why
> > it should not happen in mainline.
> > 
> > The bug was triggered because a page had a valid mapping but by the time
> > set_page_dirty() was called there was no valid entry in the radix tree.
> > This was reproduced while underlying storage was unplugged but this may
> > be indirectly related to the problem.
> > 
> > This bug only triggers on s390 and may be explained by a race. Normally
> > when pages are being readahead in read_cache_pages(), an attempt is made
> > to add the page to the page cache. If that fails, the page is invalidated
> > by locking it, giving it a valid mapping, calling do_invalidatepage()
> > and then setting mapping to NULL again. It's similar with page cache is
> > being deleted. The page is locked, the tree lock is taken, it's removed
> > from the radix tree and the mapping is set to NULL.
> > 
> > In many cases looking up the radix tree is protected by a combination of
> > the page lock and the tree lock. When zapping pages, the page lock is not
> > taken which does not matter as most architectures do nothing special with
> > the mapping and for file-backed pages, the mapping should be pinned by
> > the open file. However, s390 also calls set_dirty_page() for
> > PageSwapCache() pages where it is potentially racing against
> > reuse_swap_page() for example.
> > 
> > This patch splits the propagation of the storage key into a separate
> > function and introduces a new variant of page_remove_rmap() called
> > page_remove_rmap_nolock() which is only used during page table zapping
> > that attempts to acquire the page lock. The approach is ugly although it
> > works in that the bug can no longer be triggered. At the time the page
> > lock is required, the PTL is held so it cannot sleep so it busy waits on
> > trylock_page(). That potentially deadlocks against reclaim which holds
> > the page table lock and is trying to acquire the pagetable lock so in
> > some cases there is no choice except to race. In the file-backed case,
> > this is ok because the address space will be valid (zap_pte_range() makes
> > the same assumption. However, s390 needs a better way of guarding against
> > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > is being called. The patch would be marginally better if in the PageSwapCache
> > case we simply tried to lock once and in the contended case just fail to
> > propogate the storage key. I lack familiarity with the s390 architecture
> > to be certain if this is safe or not. Suggestions on a better fix?
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> I'm very confused - and wonder if you are too ;)
> 

It would not be the first time :)

> Please forgive me if I've not been patient enough in reading through
> all you've written, and knocking it into my skull: ask me to go back
> and read again.
> 
> But it seems to me that you're approaching this from the wrong end
> (page_remove_rmap), instead of from the end that's giving rise to the
> problem - perhaps because you've not yet identified that other end?
> 

Not as precisely as I would have like. I do not have access to the machine
in question unfortunately.

> I'm confused as to whether you see this problem with file pages,
> or with anon-swap-cache pages, or with both, or not yet determined.
> 

PageSwapCache pages only.

> The obvious places which set page->mapping = NULL or ClearPageSwapCache,
> once they've been visible to the outside world, are careful to do so at
> the same time as removing the radix_tree entry, holding mapping->tree_lock.
> 

The problem here is not that page->mapping is NULL but that the page->mapping
is valid while the page is not in the radix tree.

> And __set_page_dirty_nobuffers(), the set_page_dirty variant implicated
> in the trace above, is careful to recheck mapping under tree_lock, as I
> think are the others.  There is an issue if mapping itself could vanish
> while in there, but that's not a danger in page_remove_rmap().
> 
> (You do remind me that I meant years ago to switch swapper_space over
> to the much simpler __set_page_dirty_no_writeback(), which shmem has
> used for ages; but as far as this problem goes, that would probably
> be at best a workaround, rather than the proper fix.)
> 

It would be a workaround. If in the future we wanted to treat swapper
space more like a normal file inode and writeback dirty pages from
the flusher thread then this bug would just pop its head back up.

> You indicate read_cache_pages_invalidate_page() above, and yes that's
> weird the way it sets and unsets page->mapping; I never came across it
> before, but IIUC it's safe because those pages have not yet been made
> visible to the outside world.  (Or do we have a speculative pagecache
> issue here?)
> 

We do not have a speculative pagecache issue here that I'm aware of. 

> You compare do_invalidatepage() with deletion from page cache, but I hope
> that's mistaken: aren't such pages usually being invalidated precisely
> because they raced with pages found already there in the page cache,
> which need to be left in place?
> 

True. What I was trying to do was point out that we usually handle
pages being removed from the page cache with a combination of the page
lock and tree lock. s390 does not have the same type of protection in
page_remove_rmap() when handling PageSwapCache pages.

> So, is there somewhere which is setting page->mapping = NULL, or
> doing ClearPageSwapCache, in a separate block from removing that
> page's entry from the radix_tree?  If there is, then I think it
> could pose a problem on more than s390.
> 
> Hmm, mm/migrate.c.
> 

Migration moves the page mapping under the tree lock so
__set_page_dirty_nobuffers() I don't think that is it.

I think the race is against something like reuse_swap_page() which locks
the page and removes it from swap cache while page_remove_rmap() looks
up the same page.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-17 12:22     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-17 12:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Linux-MM,
	Linux-S390, LKML

On Mon, Apr 16, 2012 at 02:14:09PM -0700, Hugh Dickins wrote:
> On Mon, 16 Apr 2012, Mel Gorman wrote:
> 
> > This patch is horribly ugly and there has to be a better way of doing
> > it. I'm looking for suggestions on what s390 can do here that is not
> > painful or broken. 
> > 
> > The following bug was reported on s390
> > 
> > kernel BUG at
> > /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
> > illegal operation: 0001 [#1] SMP
> > Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
> > loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
> > ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
> > scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
> > Supported: Yes
> > CPU: 3 Not tainted 3.0.13-0.27-default #1
> > Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
> > Krnl PSW : 0404000180000000 000000000037935e
> > (radix_tree_tag_set+0xfe/0x118)
> >            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
> >            0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
> >            0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
> >            000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
> > Krnl Code: 0000000000379352: a7cafffa           ahi     %r12,-6
> >            0000000000379356: a7f4ffb5           brc     15,3792c0
> >            000000000037935a: a7f40001           brc     15,37935c
> >           >000000000037935e: a7f40000           brc     15,37935e
> >            0000000000379362: b90200bb           ltgr    %r11,%r11
> >            0000000000379366: a784fff0           brc     8,379346
> >            000000000037936a: a7f4ffe1           brc     15,37932c
> >            000000000037936e: a7f40001           brc     15,379370
> > Call Trace:
> > ([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
> >  [<0000000000234970>] page_remove_rmap+0x100/0x104
> >  [<0000000000228608>] zap_pte_range+0x194/0x608
> >  [<0000000000228caa>] unmap_page_range+0x22e/0x33c
> >  [<0000000000228ec2>] unmap_vmas+0x10a/0x274
> >  [<000000000022f2f6>] exit_mmap+0xd2/0x254
> >  [<00000000001472d6>] mmput+0x5e/0x144
> >  [<000000000014d306>] exit_mm+0x196/0x1d4
> >  [<000000000014f650>] do_exit+0x18c/0x408
> >  [<000000000014f92c>] do_group_exit+0x60/0xec
> >  [<000000000014f9ea>] SyS_exit_group+0x32/0x40
> >  [<00000000004e1660>] sysc_noemu+0x16/0x1c
> >  [<000003fffd23ab96>] 0x3fffd23ab96
> > Last Breaking-Event-Address:
> >  [<000000000037935a>] radix_tree_tag_set+0xfa/0x118
> > 
> > While this bug was reproduced on a 3.0.x kernel, there is no reason why
> > it should not happen in mainline.
> > 
> > The bug was triggered because a page had a valid mapping but by the time
> > set_page_dirty() was called there was no valid entry in the radix tree.
> > This was reproduced while underlying storage was unplugged but this may
> > be indirectly related to the problem.
> > 
> > This bug only triggers on s390 and may be explained by a race. Normally
> > when pages are being readahead in read_cache_pages(), an attempt is made
> > to add the page to the page cache. If that fails, the page is invalidated
> > by locking it, giving it a valid mapping, calling do_invalidatepage()
> > and then setting mapping to NULL again. It's similar with page cache is
> > being deleted. The page is locked, the tree lock is taken, it's removed
> > from the radix tree and the mapping is set to NULL.
> > 
> > In many cases looking up the radix tree is protected by a combination of
> > the page lock and the tree lock. When zapping pages, the page lock is not
> > taken which does not matter as most architectures do nothing special with
> > the mapping and for file-backed pages, the mapping should be pinned by
> > the open file. However, s390 also calls set_dirty_page() for
> > PageSwapCache() pages where it is potentially racing against
> > reuse_swap_page() for example.
> > 
> > This patch splits the propagation of the storage key into a separate
> > function and introduces a new variant of page_remove_rmap() called
> > page_remove_rmap_nolock() which is only used during page table zapping
> > that attempts to acquire the page lock. The approach is ugly although it
> > works in that the bug can no longer be triggered. At the time the page
> > lock is required, the PTL is held so it cannot sleep so it busy waits on
> > trylock_page(). That potentially deadlocks against reclaim which holds
> > the page table lock and is trying to acquire the pagetable lock so in
> > some cases there is no choice except to race. In the file-backed case,
> > this is ok because the address space will be valid (zap_pte_range() makes
> > the same assumption. However, s390 needs a better way of guarding against
> > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > is being called. The patch would be marginally better if in the PageSwapCache
> > case we simply tried to lock once and in the contended case just fail to
> > propogate the storage key. I lack familiarity with the s390 architecture
> > to be certain if this is safe or not. Suggestions on a better fix?
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> I'm very confused - and wonder if you are too ;)
> 

It would not be the first time :)

> Please forgive me if I've not been patient enough in reading through
> all you've written, and knocking it into my skull: ask me to go back
> and read again.
> 
> But it seems to me that you're approaching this from the wrong end
> (page_remove_rmap), instead of from the end that's giving rise to the
> problem - perhaps because you've not yet identified that other end?
> 

Not as precisely as I would have like. I do not have access to the machine
in question unfortunately.

> I'm confused as to whether you see this problem with file pages,
> or with anon-swap-cache pages, or with both, or not yet determined.
> 

PageSwapCache pages only.

> The obvious places which set page->mapping = NULL or ClearPageSwapCache,
> once they've been visible to the outside world, are careful to do so at
> the same time as removing the radix_tree entry, holding mapping->tree_lock.
> 

The problem here is not that page->mapping is NULL but that the page->mapping
is valid while the page is not in the radix tree.

> And __set_page_dirty_nobuffers(), the set_page_dirty variant implicated
> in the trace above, is careful to recheck mapping under tree_lock, as I
> think are the others.  There is an issue if mapping itself could vanish
> while in there, but that's not a danger in page_remove_rmap().
> 
> (You do remind me that I meant years ago to switch swapper_space over
> to the much simpler __set_page_dirty_no_writeback(), which shmem has
> used for ages; but as far as this problem goes, that would probably
> be at best a workaround, rather than the proper fix.)
> 

It would be a workaround. If in the future we wanted to treat swapper
space more like a normal file inode and writeback dirty pages from
the flusher thread then this bug would just pop its head back up.

> You indicate read_cache_pages_invalidate_page() above, and yes that's
> weird the way it sets and unsets page->mapping; I never came across it
> before, but IIUC it's safe because those pages have not yet been made
> visible to the outside world.  (Or do we have a speculative pagecache
> issue here?)
> 

We do not have a speculative pagecache issue here that I'm aware of. 

> You compare do_invalidatepage() with deletion from page cache, but I hope
> that's mistaken: aren't such pages usually being invalidated precisely
> because they raced with pages found already there in the page cache,
> which need to be left in place?
> 

True. What I was trying to do was point out that we usually handle
pages being removed from the page cache with a combination of the page
lock and tree lock. s390 does not have the same type of protection in
page_remove_rmap() when handling PageSwapCache pages.

> So, is there somewhere which is setting page->mapping = NULL, or
> doing ClearPageSwapCache, in a separate block from removing that
> page's entry from the radix_tree?  If there is, then I think it
> could pose a problem on more than s390.
> 
> Hmm, mm/migrate.c.
> 

Migration moves the page mapping under the tree lock so
__set_page_dirty_nobuffers() I don't think that is it.

I think the race is against something like reuse_swap_page() which locks
the page and removes it from swap cache while page_remove_rmap() looks
up the same page.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-16 15:50   ` Martin Schwidefsky
@ 2012-04-17 12:29     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-17 12:29 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Heiko Carstens, Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

On Mon, Apr 16, 2012 at 05:50:40PM +0200, Martin Schwidefsky wrote:
> On Mon, 16 Apr 2012 15:14:23 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > This patch is horribly ugly and there has to be a better way of doing
> > it. I'm looking for suggestions on what s390 can do here that is not
> > painful or broken. 
> > 
> > However, s390 needs a better way of guarding against
> > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > is being called. The patch would be marginally better if in the PageSwapCache
> > case we simply tried to lock once and in the contended case just fail to
> > propogate the storage key. I lack familiarity with the s390 architecture
> > to be certain if this is safe or not. Suggestions on a better fix?
> 
> One though that crossed my mind is that maybe a better approach would be
> to move the page_test_and_clear_dirty check out of page_remove_rmap.
> What we need to look out for are code sequences of the form:
> 
> 	if (pte_dirty(pte))
> 		set_page_dirty(page);
> 	...
> 	page_remove_rmap(page);
> 
> There are four of those as far as I can see: in try_to_unmap_one,
> try_to_unmap_cluster, zap_pte, and zap_pte_range.
> 
> A valid implementation for s390 would be to test and clear the changed
> bit in the storage key for every of those pte_dirty() calls.
> 
> 	if (pte_dirty(pte) || page_test_and_clear_dirty(page))
> 		set_page_dirty(page);
> 	...
> 	page_remove_rmap(page); /* w/o page_test_clear_dirty */
> 

In the zap_pte_range() case at least, pte_dirty() is only being checked
for !PageAnon pages so if we took this approach we would miss
PageSwapCache pages. If we added the check then the same problem is hit
and we'd need additional logic there for s390 to drop the PTL, take the
page lock and retry the operation. It'd still be ugly :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-17 12:29     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-17 12:29 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Heiko Carstens, Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

On Mon, Apr 16, 2012 at 05:50:40PM +0200, Martin Schwidefsky wrote:
> On Mon, 16 Apr 2012 15:14:23 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > This patch is horribly ugly and there has to be a better way of doing
> > it. I'm looking for suggestions on what s390 can do here that is not
> > painful or broken. 
> > 
> > However, s390 needs a better way of guarding against
> > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > is being called. The patch would be marginally better if in the PageSwapCache
> > case we simply tried to lock once and in the contended case just fail to
> > propogate the storage key. I lack familiarity with the s390 architecture
> > to be certain if this is safe or not. Suggestions on a better fix?
> 
> One though that crossed my mind is that maybe a better approach would be
> to move the page_test_and_clear_dirty check out of page_remove_rmap.
> What we need to look out for are code sequences of the form:
> 
> 	if (pte_dirty(pte))
> 		set_page_dirty(page);
> 	...
> 	page_remove_rmap(page);
> 
> There are four of those as far as I can see: in try_to_unmap_one,
> try_to_unmap_cluster, zap_pte, and zap_pte_range.
> 
> A valid implementation for s390 would be to test and clear the changed
> bit in the storage key for every of those pte_dirty() calls.
> 
> 	if (pte_dirty(pte) || page_test_and_clear_dirty(page))
> 		set_page_dirty(page);
> 	...
> 	page_remove_rmap(page); /* w/o page_test_clear_dirty */
> 

In the zap_pte_range() case at least, pte_dirty() is only being checked
for !PageAnon pages so if we took this approach we would miss
PageSwapCache pages. If we added the check then the same problem is hit
and we'd need additional logic there for s390 to drop the PTL, take the
page lock and retry the operation. It'd still be ugly :(

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-17 12:29     ` Mel Gorman
@ 2012-04-17 13:02       ` Martin Schwidefsky
  -1 siblings, 0 replies; 32+ messages in thread
From: Martin Schwidefsky @ 2012-04-17 13:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Heiko Carstens, Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

On Tue, 17 Apr 2012 13:29:25 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Apr 16, 2012 at 05:50:40PM +0200, Martin Schwidefsky wrote:
> > On Mon, 16 Apr 2012 15:14:23 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > This patch is horribly ugly and there has to be a better way of doing
> > > it. I'm looking for suggestions on what s390 can do here that is not
> > > painful or broken. 
> > > 
> > > However, s390 needs a better way of guarding against
> > > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > > is being called. The patch would be marginally better if in the PageSwapCache
> > > case we simply tried to lock once and in the contended case just fail to
> > > propogate the storage key. I lack familiarity with the s390 architecture
> > > to be certain if this is safe or not. Suggestions on a better fix?
> > 
> > One though that crossed my mind is that maybe a better approach would be
> > to move the page_test_and_clear_dirty check out of page_remove_rmap.
> > What we need to look out for are code sequences of the form:
> > 
> > 	if (pte_dirty(pte))
> > 		set_page_dirty(page);
> > 	...
> > 	page_remove_rmap(page);
> > 
> > There are four of those as far as I can see: in try_to_unmap_one,
> > try_to_unmap_cluster, zap_pte, and zap_pte_range.
> > 
> > A valid implementation for s390 would be to test and clear the changed
> > bit in the storage key for every of those pte_dirty() calls.
> > 
> > 	if (pte_dirty(pte) || page_test_and_clear_dirty(page))
> > 		set_page_dirty(page);
> > 	...
> > 	page_remove_rmap(page); /* w/o page_test_clear_dirty */
> > 
> 
> In the zap_pte_range() case at least, pte_dirty() is only being checked
> for !PageAnon pages so if we took this approach we would miss
> PageSwapCache pages. If we added the check then the same problem is hit
> and we'd need additional logic there for s390 to drop the PTL, take the
> page lock and retry the operation. It'd still be ugly :(

Well if x86 can get away with ignoring PageSwapCache pages in zap_pte_range()
pages then s390 should be able to get away with it as well, no ?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-17 13:02       ` Martin Schwidefsky
  0 siblings, 0 replies; 32+ messages in thread
From: Martin Schwidefsky @ 2012-04-17 13:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Heiko Carstens, Hugh Dickins, Rik van Riel, Linux-MM, Linux-S390, LKML

On Tue, 17 Apr 2012 13:29:25 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Apr 16, 2012 at 05:50:40PM +0200, Martin Schwidefsky wrote:
> > On Mon, 16 Apr 2012 15:14:23 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > This patch is horribly ugly and there has to be a better way of doing
> > > it. I'm looking for suggestions on what s390 can do here that is not
> > > painful or broken. 
> > > 
> > > However, s390 needs a better way of guarding against
> > > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > > is being called. The patch would be marginally better if in the PageSwapCache
> > > case we simply tried to lock once and in the contended case just fail to
> > > propogate the storage key. I lack familiarity with the s390 architecture
> > > to be certain if this is safe or not. Suggestions on a better fix?
> > 
> > One though that crossed my mind is that maybe a better approach would be
> > to move the page_test_and_clear_dirty check out of page_remove_rmap.
> > What we need to look out for are code sequences of the form:
> > 
> > 	if (pte_dirty(pte))
> > 		set_page_dirty(page);
> > 	...
> > 	page_remove_rmap(page);
> > 
> > There are four of those as far as I can see: in try_to_unmap_one,
> > try_to_unmap_cluster, zap_pte, and zap_pte_range.
> > 
> > A valid implementation for s390 would be to test and clear the changed
> > bit in the storage key for every of those pte_dirty() calls.
> > 
> > 	if (pte_dirty(pte) || page_test_and_clear_dirty(page))
> > 		set_page_dirty(page);
> > 	...
> > 	page_remove_rmap(page); /* w/o page_test_clear_dirty */
> > 
> 
> In the zap_pte_range() case at least, pte_dirty() is only being checked
> for !PageAnon pages so if we took this approach we would miss
> PageSwapCache pages. If we added the check then the same problem is hit
> and we'd need additional logic there for s390 to drop the PTL, take the
> page lock and retry the operation. It'd still be ugly :(

Well if x86 can get away with ignoring PageSwapCache pages in zap_pte_range()
pages then s390 should be able to get away with it as well, no ?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-17 12:22     ` Mel Gorman
@ 2012-04-18  3:52       ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-18  3:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Tue, 17 Apr 2012, Mel Gorman wrote:
> On Mon, Apr 16, 2012 at 02:14:09PM -0700, Hugh Dickins wrote:
> > On Mon, 16 Apr 2012, Mel Gorman wrote:
> > 
> > > This patch is horribly ugly and there has to be a better way of doing
> > > it. I'm looking for suggestions on what s390 can do here that is not
> > > painful or broken. 
> > > 
> > > The following bug was reported on s390
> > > 
> > > kernel BUG at
> > > /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
>... 
> > I'm confused as to whether you see this problem with file pages,
> > or with anon-swap-cache pages, or with both, or not yet determined.
> 
> PageSwapCache pages only.

Oh good, thanks, that narrowed the search space a lot.

> > (You do remind me that I meant years ago to switch swapper_space over
> > to the much simpler __set_page_dirty_no_writeback(), which shmem has
> > used for ages; but as far as this problem goes, that would probably
> > be at best a workaround, rather than the proper fix.)
> 
> It would be a workaround. If in the future we wanted to treat swapper
> space more like a normal file inode and writeback dirty pages from
> the flusher thread then this bug would just pop its head back up.

It's a no-brainer workaround: patch and more explanation below.  I
can double-fix it if you prefer, but the one-liner appeals more to me.

> > Hmm, mm/migrate.c.
> 
> Migration moves the page mapping under the tree lock so
> __set_page_dirty_nobuffers() I don't think that is it.

Yes, I was worried by the places that set page->mapping = NULL in
migrate.c (later, not under the tree_lock), but those would not be able
to generate this issue at all (ptes already replaced by migration entries).

> I think the race is against something like reuse_swap_page() which locks
> the page and removes it from swap cache while page_remove_rmap() looks
> up the same page.

No, __delete_from_swap_cache() is always doing ClearPageSwapCache under
tree_lock (which __set_page_dirty_no_buffers acquires before proceeding).


[PATCH] mm: fix s390 BUG by using __set_page_dirty_no_writeback on swap

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
transfer dirty flag from s390 storage key to struct page and radix_tree.

That would be because of reclaim's shrink_page_list() calling add_to_swap()
on this page at the same time: first PageSwapCache is set (causing
page_mapping(page) to appear as &swapper_space), then page->private set,
then tree_lock taken, then page inserted into radix_tree - so there's
an interval before taking the lock when the radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before SetPageSwapCache, with error case ClearPageSwapCache moved up
under tree_lock too.

But a better fix is just to do what's five years overdue.  Ken Chen
added __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty)
for tmpfs to skip all that radix_tree overhead, and swap is just the same:
it ignores the radix_tree tag, and does not participate in dirty page
accounting, so should be using __set_page_dirty_no_writeback() too.

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
---

 mm/swap_state.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 3.4-rc2/mm/swap_state.c	2012-03-31 17:42:26.949729938 -0700
+++ linux/mm/swap_state.c	2012-04-17 15:34:05.732086663 -0700
@@ -26,7 +26,7 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
 	.migratepage	= migrate_page,
 };
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-18  3:52       ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-18  3:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Tue, 17 Apr 2012, Mel Gorman wrote:
> On Mon, Apr 16, 2012 at 02:14:09PM -0700, Hugh Dickins wrote:
> > On Mon, 16 Apr 2012, Mel Gorman wrote:
> > 
> > > This patch is horribly ugly and there has to be a better way of doing
> > > it. I'm looking for suggestions on what s390 can do here that is not
> > > painful or broken. 
> > > 
> > > The following bug was reported on s390
> > > 
> > > kernel BUG at
> > > /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
>... 
> > I'm confused as to whether you see this problem with file pages,
> > or with anon-swap-cache pages, or with both, or not yet determined.
> 
> PageSwapCache pages only.

Oh good, thanks, that narrowed the search space a lot.

> > (You do remind me that I meant years ago to switch swapper_space over
> > to the much simpler __set_page_dirty_no_writeback(), which shmem has
> > used for ages; but as far as this problem goes, that would probably
> > be at best a workaround, rather than the proper fix.)
> 
> It would be a workaround. If in the future we wanted to treat swapper
> space more like a normal file inode and writeback dirty pages from
> the flusher thread then this bug would just pop its head back up.

It's a no-brainer workaround: patch and more explanation below.  I
can double-fix it if you prefer, but the one-liner appeals more to me.

> > Hmm, mm/migrate.c.
> 
> Migration moves the page mapping under the tree lock so
> __set_page_dirty_nobuffers() I don't think that is it.

Yes, I was worried by the places that set page->mapping = NULL in
migrate.c (later, not under the tree_lock), but those would not be able
to generate this issue at all (ptes already replaced by migration entries).

> I think the race is against something like reuse_swap_page() which locks
> the page and removes it from swap cache while page_remove_rmap() looks
> up the same page.

No, __delete_from_swap_cache() is always doing ClearPageSwapCache under
tree_lock (which __set_page_dirty_no_buffers acquires before proceeding).


[PATCH] mm: fix s390 BUG by using __set_page_dirty_no_writeback on swap

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
transfer dirty flag from s390 storage key to struct page and radix_tree.

That would be because of reclaim's shrink_page_list() calling add_to_swap()
on this page at the same time: first PageSwapCache is set (causing
page_mapping(page) to appear as &swapper_space), then page->private set,
then tree_lock taken, then page inserted into radix_tree - so there's
an interval before taking the lock when the radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before SetPageSwapCache, with error case ClearPageSwapCache moved up
under tree_lock too.

But a better fix is just to do what's five years overdue.  Ken Chen
added __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty)
for tmpfs to skip all that radix_tree overhead, and swap is just the same:
it ignores the radix_tree tag, and does not participate in dirty page
accounting, so should be using __set_page_dirty_no_writeback() too.

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
---

 mm/swap_state.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 3.4-rc2/mm/swap_state.c	2012-03-31 17:42:26.949729938 -0700
+++ linux/mm/swap_state.c	2012-04-17 15:34:05.732086663 -0700
@@ -26,7 +26,7 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
 	.migratepage	= migrate_page,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-17 13:02       ` Martin Schwidefsky
@ 2012-04-18  4:00         ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-18  4:00 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Mel Gorman, Heiko Carstens, Rik van Riel, Linux-MM, Linux-S390, LKML

On Tue, 17 Apr 2012, Martin Schwidefsky wrote:
> On Tue, 17 Apr 2012 13:29:25 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> > 
> > In the zap_pte_range() case at least, pte_dirty() is only being checked
> > for !PageAnon pages so if we took this approach we would miss
> > PageSwapCache pages. If we added the check then the same problem is hit
> > and we'd need additional logic there for s390 to drop the PTL, take the
> > page lock and retry the operation. It'd still be ugly :(
> 
> Well if x86 can get away with ignoring PageSwapCache pages in zap_pte_range()
> pages then s390 should be able to get away with it as well, no ?

When it's zap_pte_range() calling page_remove_rmap(), yes; but that's not
the only caller of page_remove_rmap(), and I believe there's at least one
caller which absolutely needs it to do that s390 set_page_dirty() on swap.

But I don't see any need to be discussing ugly patches for this any more:
there's a very simple patch which improves the swap path anyway, and if
deemed advisable, we can also rearrange __add_to_swap_cache() a little.

Hugh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-18  4:00         ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-18  4:00 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Mel Gorman, Heiko Carstens, Rik van Riel, Linux-MM, Linux-S390, LKML

On Tue, 17 Apr 2012, Martin Schwidefsky wrote:
> On Tue, 17 Apr 2012 13:29:25 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> > 
> > In the zap_pte_range() case at least, pte_dirty() is only being checked
> > for !PageAnon pages so if we took this approach we would miss
> > PageSwapCache pages. If we added the check then the same problem is hit
> > and we'd need additional logic there for s390 to drop the PTL, take the
> > page lock and retry the operation. It'd still be ugly :(
> 
> Well if x86 can get away with ignoring PageSwapCache pages in zap_pte_range()
> pages then s390 should be able to get away with it as well, no ?

When it's zap_pte_range() calling page_remove_rmap(), yes; but that's not
the only caller of page_remove_rmap(), and I believe there's at least one
caller which absolutely needs it to do that s390 set_page_dirty() on swap.

But I don't see any need to be discussing ugly patches for this any more:
there's a very simple patch which improves the swap path anyway, and if
deemed advisable, we can also rearrange __add_to_swap_cache() a little.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-18  3:52       ` Hugh Dickins
@ 2012-04-18 15:28         ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-18 15:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Tue, Apr 17, 2012 at 08:52:21PM -0700, Hugh Dickins wrote:
> > > (You do remind me that I meant years ago to switch swapper_space over
> > > to the much simpler __set_page_dirty_no_writeback(), which shmem has
> > > used for ages; but as far as this problem goes, that would probably
> > > be at best a workaround, rather than the proper fix.)
> > 
> > It would be a workaround. If in the future we wanted to treat swapper
> > space more like a normal file inode and writeback dirty pages from
> > the flusher thread then this bug would just pop its head back up.
> 
> It's a no-brainer workaround: patch and more explanation below.  I
> can double-fix it if you prefer, but the one-liner appeals more to me.
> 

Ok, fair enough. While I think swapper space will eventually use the dirty
tag information that day is not today.

> > > Hmm, mm/migrate.c.
> > 
> > Migration moves the page mapping under the tree lock so
> > __set_page_dirty_nobuffers() I don't think that is it.
> 
> Yes, I was worried by the places that set page->mapping = NULL in
> migrate.c (later, not under the tree_lock), but those would not be able
> to generate this issue at all (ptes already replaced by migration entries).
> 

Yes.

> <SNIP>
>
> [PATCH] mm: fix s390 BUG by using __set_page_dirty_no_writeback on swap
> 
> Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
> called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
> transfer dirty flag from s390 storage key to struct page and radix_tree.
> 
> That would be because of reclaim's shrink_page_list() calling add_to_swap()
> on this page at the same time: first PageSwapCache is set (causing
> page_mapping(page) to appear as &swapper_space), then page->private set,
> then tree_lock taken, then page inserted into radix_tree - so there's
> an interval before taking the lock when the radix_tree slot is empty.
> 

Yes, makes sense.

> We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
> before SetPageSwapCache, with error case ClearPageSwapCache moved up
> under tree_lock too.
> 

This can be done if/when swapper_space can make proper use of the dirty
tag information.

> But a better fix is just to do what's five years overdue.  Ken Chen
> added __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty)
> for tmpfs to skip all that radix_tree overhead, and swap is just the same:
> it ignores the radix_tree tag, and does not participate in dirty page
> accounting, so should be using __set_page_dirty_no_writeback() too.
> 
> Reported-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

I've sent a kernel based on this patch to the s390 folk that originally
reported the bug. Hopefully they'll test and get back to me in a few
days.

Thanks Hugh.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-18 15:28         ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-18 15:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Tue, Apr 17, 2012 at 08:52:21PM -0700, Hugh Dickins wrote:
> > > (You do remind me that I meant years ago to switch swapper_space over
> > > to the much simpler __set_page_dirty_no_writeback(), which shmem has
> > > used for ages; but as far as this problem goes, that would probably
> > > be at best a workaround, rather than the proper fix.)
> > 
> > It would be a workaround. If in the future we wanted to treat swapper
> > space more like a normal file inode and writeback dirty pages from
> > the flusher thread then this bug would just pop its head back up.
> 
> It's a no-brainer workaround: patch and more explanation below.  I
> can double-fix it if you prefer, but the one-liner appeals more to me.
> 

Ok, fair enough. While I think swapper space will eventually use the dirty
tag information that day is not today.

> > > Hmm, mm/migrate.c.
> > 
> > Migration moves the page mapping under the tree lock so
> > __set_page_dirty_nobuffers() I don't think that is it.
> 
> Yes, I was worried by the places that set page->mapping = NULL in
> migrate.c (later, not under the tree_lock), but those would not be able
> to generate this issue at all (ptes already replaced by migration entries).
> 

Yes.

> <SNIP>
>
> [PATCH] mm: fix s390 BUG by using __set_page_dirty_no_writeback on swap
> 
> Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
> called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
> transfer dirty flag from s390 storage key to struct page and radix_tree.
> 
> That would be because of reclaim's shrink_page_list() calling add_to_swap()
> on this page at the same time: first PageSwapCache is set (causing
> page_mapping(page) to appear as &swapper_space), then page->private set,
> then tree_lock taken, then page inserted into radix_tree - so there's
> an interval before taking the lock when the radix_tree slot is empty.
> 

Yes, makes sense.

> We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
> before SetPageSwapCache, with error case ClearPageSwapCache moved up
> under tree_lock too.
> 

This can be done if/when swapper_space can make proper use of the dirty
tag information.

> But a better fix is just to do what's five years overdue.  Ken Chen
> added __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty)
> for tmpfs to skip all that radix_tree overhead, and swap is just the same:
> it ignores the radix_tree tag, and does not participate in dirty page
> accounting, so should be using __set_page_dirty_no_writeback() too.
> 
> Reported-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

I've sent a kernel based on this patch to the s390 folk that originally
reported the bug. Hopefully they'll test and get back to me in a few
days.

Thanks Hugh.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-18 15:28         ` Mel Gorman
@ 2012-04-18 17:09           ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-18 17:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Wed, 18 Apr 2012, Mel Gorman wrote:
> On Tue, Apr 17, 2012 at 08:52:21PM -0700, Hugh Dickins wrote:
> > 
> > It's a no-brainer workaround: patch and more explanation below.  I
> > can double-fix it if you prefer, but the one-liner appeals more to me.

(I became much happier about fixing it from this end, once I understood
how the situation came about.  What I had disliked with yours, was an
admitted ugly patch, when we didn't even understand the root cause.)

> 
> Ok, fair enough. While I think swapper space will eventually use the dirty
> tag information that day is not today.

Yes, it's never been self-evident to me, why swap should not participate
in any of the dirty writeback stuff.  But we've got along for years that
way, and as you say, won't be changing it right now.

> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> I've sent a kernel based on this patch to the s390 folk that originally
> reported the bug. Hopefully they'll test and get back to me in a few
> days.

Thanks - I'll leave it at that for the moment, expecting to hear back.

Hugh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-18 17:09           ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-18 17:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Wed, 18 Apr 2012, Mel Gorman wrote:
> On Tue, Apr 17, 2012 at 08:52:21PM -0700, Hugh Dickins wrote:
> > 
> > It's a no-brainer workaround: patch and more explanation below.  I
> > can double-fix it if you prefer, but the one-liner appeals more to me.

(I became much happier about fixing it from this end, once I understood
how the situation came about.  What I had disliked with yours, was an
admitted ugly patch, when we didn't even understand the root cause.)

> 
> Ok, fair enough. While I think swapper space will eventually use the dirty
> tag information that day is not today.

Yes, it's never been self-evident to me, why swap should not participate
in any of the dirty writeback stuff.  But we've got along for years that
way, and as you say, won't be changing it right now.

> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> I've sent a kernel based on this patch to the s390 folk that originally
> reported the bug. Hopefully they'll test and get back to me in a few
> days.

Thanks - I'll leave it at that for the moment, expecting to hear back.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-18 15:28         ` Mel Gorman
@ 2012-04-18 18:29           ` Martin Schwidefsky
  -1 siblings, 0 replies; 32+ messages in thread
From: Martin Schwidefsky @ 2012-04-18 18:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Heiko Carstens, Rik van Riel, Ken Chen, Linux-MM,
	Linux-S390, LKML

On Wed, 18 Apr 2012 16:28:31 +0100
Mel Gorman <mgorman@suse.de> wrote:

> > <SNIP>
> >
> > [PATCH] mm: fix s390 BUG by using __set_page_dirty_no_writeback on swap
> > 
> > Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
> > called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
> > transfer dirty flag from s390 storage key to struct page and radix_tree.
> > 
> > That would be because of reclaim's shrink_page_list() calling add_to_swap()
> > on this page at the same time: first PageSwapCache is set (causing
> > page_mapping(page) to appear as &swapper_space), then page->private set,
> > then tree_lock taken, then page inserted into radix_tree - so there's
> > an interval before taking the lock when the radix_tree slot is empty.
> > 
> 
> Yes, makes sense.
> 
> > We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
> > before SetPageSwapCache, with error case ClearPageSwapCache moved up
> > under tree_lock too.
> > 
> 
> This can be done if/when swapper_space can make proper use of the dirty
> tag information.
> 
> > But a better fix is just to do what's five years overdue.  Ken Chen
> > added __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty)
> > for tmpfs to skip all that radix_tree overhead, and swap is just the same:
> > it ignores the radix_tree tag, and does not participate in dirty page
> > accounting, so should be using __set_page_dirty_no_writeback() too.
> > 
> > Reported-by: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> I've sent a kernel based on this patch to the s390 folk that originally
> reported the bug. Hopefully they'll test and get back to me in a few
> days.
> 
> Thanks Hugh.

Indeed, thanks Hugh. The patches so far were not pretty at all.. 


-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-18 18:29           ` Martin Schwidefsky
  0 siblings, 0 replies; 32+ messages in thread
From: Martin Schwidefsky @ 2012-04-18 18:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Heiko Carstens, Rik van Riel, Ken Chen, Linux-MM,
	Linux-S390, LKML

On Wed, 18 Apr 2012 16:28:31 +0100
Mel Gorman <mgorman@suse.de> wrote:

> > <SNIP>
> >
> > [PATCH] mm: fix s390 BUG by using __set_page_dirty_no_writeback on swap
> > 
> > Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
> > called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
> > transfer dirty flag from s390 storage key to struct page and radix_tree.
> > 
> > That would be because of reclaim's shrink_page_list() calling add_to_swap()
> > on this page at the same time: first PageSwapCache is set (causing
> > page_mapping(page) to appear as &swapper_space), then page->private set,
> > then tree_lock taken, then page inserted into radix_tree - so there's
> > an interval before taking the lock when the radix_tree slot is empty.
> > 
> 
> Yes, makes sense.
> 
> > We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
> > before SetPageSwapCache, with error case ClearPageSwapCache moved up
> > under tree_lock too.
> > 
> 
> This can be done if/when swapper_space can make proper use of the dirty
> tag information.
> 
> > But a better fix is just to do what's five years overdue.  Ken Chen
> > added __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty)
> > for tmpfs to skip all that radix_tree overhead, and swap is just the same:
> > it ignores the radix_tree tag, and does not participate in dirty page
> > accounting, so should be using __set_page_dirty_no_writeback() too.
> > 
> > Reported-by: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> I've sent a kernel based on this patch to the s390 folk that originally
> reported the bug. Hopefully they'll test and get back to me in a few
> days.
> 
> Thanks Hugh.

Indeed, thanks Hugh. The patches so far were not pretty at all.. 


-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-18 17:09           ` Hugh Dickins
@ 2012-04-23 12:41             ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-23 12:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Wed, Apr 18, 2012 at 10:09:24AM -0700, Hugh Dickins wrote:
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > 
> > I've sent a kernel based on this patch to the s390 folk that originally
> > reported the bug. Hopefully they'll test and get back to me in a few
> > days.
> 
> Thanks - I'll leave it at that for the moment, expecting to hear back.
> 

Tests completed successfully confirming that this was certainly a
PageSwapCache issue. Will you resend the patch to Andrew for merging or
will I?

Thanks Hugh.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-23 12:41             ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2012-04-23 12:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Wed, Apr 18, 2012 at 10:09:24AM -0700, Hugh Dickins wrote:
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > 
> > I've sent a kernel based on this patch to the s390 folk that originally
> > reported the bug. Hopefully they'll test and get back to me in a few
> > days.
> 
> Thanks - I'll leave it at that for the moment, expecting to hear back.
> 

Tests completed successfully confirming that this was certainly a
PageSwapCache issue. Will you resend the patch to Andrew for merging or
will I?

Thanks Hugh.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
  2012-04-23 12:41             ` Mel Gorman
@ 2012-04-23 18:09               ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-23 18:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Mon, 23 Apr 2012, Mel Gorman wrote:
> On Wed, Apr 18, 2012 at 10:09:24AM -0700, Hugh Dickins wrote:
> > > Acked-by: Mel Gorman <mgorman@suse.de>
> > > 
> > > I've sent a kernel based on this patch to the s390 folk that originally
> > > reported the bug. Hopefully they'll test and get back to me in a few
> > > days.
> > 
> > Thanks - I'll leave it at that for the moment, expecting to hear back.
> > 
> 
> Tests completed successfully confirming that this was certainly a
> PageSwapCache issue. Will you resend the patch to Andrew for merging or
> will I?
> 
> Thanks Hugh.

Great, thanks Mel: as it's a one-liner, I'll send it straight to Linus now.

Hugh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
@ 2012-04-23 18:09               ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-23 18:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Schwidefsky, Heiko Carstens, Rik van Riel, Ken Chen,
	Linux-MM, Linux-S390, LKML

On Mon, 23 Apr 2012, Mel Gorman wrote:
> On Wed, Apr 18, 2012 at 10:09:24AM -0700, Hugh Dickins wrote:
> > > Acked-by: Mel Gorman <mgorman@suse.de>
> > > 
> > > I've sent a kernel based on this patch to the s390 folk that originally
> > > reported the bug. Hopefully they'll test and get back to me in a few
> > > days.
> > 
> > Thanks - I'll leave it at that for the moment, expecting to hear back.
> > 
> 
> Tests completed successfully confirming that this was certainly a
> PageSwapCache issue. Will you resend the patch to Andrew for merging or
> will I?
> 
> Thanks Hugh.

Great, thanks Mel: as it's a one-liner, I'll send it straight to Linus now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] mm: fix s390 BUG by __set_page_dirty_no_writeback on swap
  2012-04-23 18:09               ` Hugh Dickins
@ 2012-04-23 18:14                 ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-23 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Martin Schwidefsky, Heiko Carstens,
	Rik van Riel, Ken Chen, linux-mm, linux-s390, linux-kernel

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
transfer dirty flag from s390 storage key to struct page and radix_tree.

That would be because of reclaim's shrink_page_list() calling add_to_swap()
on this page at the same time: first PageSwapCache is set (causing
page_mapping(page) to appear as &swapper_space), then page->private set,
then tree_lock taken, then page inserted into radix_tree - so there's
an interval before taking the lock when the radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before the SetPageSwapCache.  But a better fix is simply to do what's
five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
(if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
overhead, and swap is just the same - it ignores the radix_tree tag,
and does not participate in dirty page accounting, so should be using
__set_page_dirty_no_writeback() too.

s390 testing now confirms that this does indeed fix the problem.

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ken Chen <kenchen@google.com>
Cc: stable@vger.kernel.org
---

 mm/swap_state.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 3.4-git/mm/swap_state.c	2012-03-31 17:42:26.949729938 -0700
+++ linux/mm/swap_state.c	2012-04-17 15:34:05.732086663 -0700
@@ -26,7 +26,7 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
 	.migratepage	= migrate_page,
 };
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] mm: fix s390 BUG by __set_page_dirty_no_writeback on swap
@ 2012-04-23 18:14                 ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2012-04-23 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mel Gorman, Martin Schwidefsky, Heiko Carstens,
	Rik van Riel, Ken Chen, linux-mm, linux-s390, linux-kernel

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13:
called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to
transfer dirty flag from s390 storage key to struct page and radix_tree.

That would be because of reclaim's shrink_page_list() calling add_to_swap()
on this page at the same time: first PageSwapCache is set (causing
page_mapping(page) to appear as &swapper_space), then page->private set,
then tree_lock taken, then page inserted into radix_tree - so there's
an interval before taking the lock when the radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before the SetPageSwapCache.  But a better fix is simply to do what's
five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
(if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
overhead, and swap is just the same - it ignores the radix_tree tag,
and does not participate in dirty page accounting, so should be using
__set_page_dirty_no_writeback() too.

s390 testing now confirms that this does indeed fix the problem.

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ken Chen <kenchen@google.com>
Cc: stable@vger.kernel.org
---

 mm/swap_state.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 3.4-git/mm/swap_state.c	2012-03-31 17:42:26.949729938 -0700
+++ linux/mm/swap_state.c	2012-04-17 15:34:05.732086663 -0700
@@ -26,7 +26,7 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
 	.migratepage	= migrate_page,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-04-23 18:15 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-16 14:14 [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock Mel Gorman
2012-04-16 14:14 ` Mel Gorman
2012-04-16 14:53 ` Rik van Riel
2012-04-16 14:53   ` Rik van Riel
2012-04-16 15:02   ` Mel Gorman
2012-04-16 15:02     ` Mel Gorman
2012-04-16 15:50 ` Martin Schwidefsky
2012-04-16 15:50   ` Martin Schwidefsky
2012-04-17 12:29   ` Mel Gorman
2012-04-17 12:29     ` Mel Gorman
2012-04-17 13:02     ` Martin Schwidefsky
2012-04-17 13:02       ` Martin Schwidefsky
2012-04-18  4:00       ` Hugh Dickins
2012-04-18  4:00         ` Hugh Dickins
2012-04-16 21:14 ` Hugh Dickins
2012-04-16 21:14   ` Hugh Dickins
2012-04-17 12:22   ` Mel Gorman
2012-04-17 12:22     ` Mel Gorman
2012-04-18  3:52     ` Hugh Dickins
2012-04-18  3:52       ` Hugh Dickins
2012-04-18 15:28       ` Mel Gorman
2012-04-18 15:28         ` Mel Gorman
2012-04-18 17:09         ` Hugh Dickins
2012-04-18 17:09           ` Hugh Dickins
2012-04-23 12:41           ` Mel Gorman
2012-04-23 12:41             ` Mel Gorman
2012-04-23 18:09             ` Hugh Dickins
2012-04-23 18:09               ` Hugh Dickins
2012-04-23 18:14               ` [PATCH] mm: fix s390 BUG by __set_page_dirty_no_writeback on swap Hugh Dickins
2012-04-23 18:14                 ` Hugh Dickins
2012-04-18 18:29         ` [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock Martin Schwidefsky
2012-04-18 18:29           ` Martin Schwidefsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.